crieur/documentation/design/retrieve.md
koalp c4ab210c4d
feat: add retrieval application and one newspaper
A first example as well as some documentation have been added

The first example builds an article location and download the article as
an html String.

The documentation explains how it has been designed and what is the goal
of the application as well as it's intended architecture
2021-04-23 22:12:02 +02:00

77 lines
3.1 KiB
Markdown

---
title: crieur-retrieve design
---
# Self-contained html
Exporting the article as a self-contained may be the easier and more reliable , as it keep the
original ui of the newspaper, and do not require to export in a different format.
Creating reusable methods to create a self-contained html page will make it easier to write
`Newspaper`s . Those methods would be part of a `crieur-retrieve-tool` library.
The `self_contained_html` function have been created to do this.
```rust
pub fn self_contained_html<S: AsRef<str>>()
html: S,
downloader: &dyn Fn(Url) -> Option<Bytes>,
) -> String
```
## Script removal
Nothing should be executed by the exported html page.
Scripts elements are contained in `<script>` tags as well as with event handlers (ex : `onclick`,
`onmousedown`).
## CSS
CSS should be retrieved and included in the web page.
To make the web pages minimal, it would be nice to remove all unused CSS, but that may be difficult technically.
## Images
All images should be included in the html page. It can be done by transforming them to base64.
A drawback is that it takes more place.
## (options) Custom filters
Allowing `Newspaper` creators to write custom html filters can allow to
The different filters that creators may want to write are :
- `delete` : delete part of the page that are useless based on css selector (navbars, account, comments)
- `link rewrite` : rewrite links so they are absolute. It can be useful if you want to keep external link, to other articles, to the comment sections, to the main page of the newspaper, et c
- other filters : asking users what filter they want to write could be useful to know if features are lacking
`delete` filters seems the most useful and is easy to do as you can just provide a list of CSS filters.
The other need to be designed.
## Minify
The html and css is minified to take the less place possible
**unimplemented** Images size could be reduced if they are too big. A format such as webp could
also be used.
## Inspiration
- [monolith](https://github.com/y2z/monolith), a CLI tool for saving complete web pages as a single HTML file
- not really a library (yet ?)
- lacks custom selector for removal of unwanted parts
- not async
## Libraries
[lol-html](https://github.com/cloudflare/lol-html) is a great library and is designed to be fast as it is streaming through rather than parsing, storing and modifying it. Unfortunately, it isn't compatible with async downloads as the library relies on setting up executors (functions) that will be runned during the processing, and those functions can't be async.
Therefore, a library that seems to be less used, [nipper](https://github.com/importcjj/nipper), has been choosen. The `Document` type of this library is not `Send`, so it can't be used in two different `Future`. To circumvent this issue, the `Document` is recreated after each `await`. The overhead of doing so have not been measured yet.
# Downloader
A `downloader` tool helps to write Newspaper interfaces. The `Download` `Trait` to allows the user to provide it's own `downloader`, it also helps to unit test as a dummy downloader can be created.