crieur/documentation/design/retrieve.md
koalp c4ab210c4d
feat: add retrieval application and one newspaper
A first example as well as some documentation have been added

The first example builds an article location and download the article as
an html String.

The documentation explains how it has been designed and what is the goal
of the application as well as it's intended architecture
2021-04-23 22:12:02 +02:00

3.1 KiB

title
crieur-retrieve design

Self-contained html

Exporting the article as a self-contained may be the easier and more reliable , as it keep the original ui of the newspaper, and do not require to export in a different format.

Creating reusable methods to create a self-contained html page will make it easier to write Newspapers . Those methods would be part of a crieur-retrieve-tool library.

The self_contained_html function have been created to do this.

pub fn self_contained_html<S: AsRef<str>>()
    html: S,
    downloader: &dyn Fn(Url) -> Option<Bytes>,
) -> String

Script removal

Nothing should be executed by the exported html page.

Scripts elements are contained in <script> tags as well as with event handlers (ex : onclick, onmousedown).

CSS

CSS should be retrieved and included in the web page.

To make the web pages minimal, it would be nice to remove all unused CSS, but that may be difficult technically.

Images

All images should be included in the html page. It can be done by transforming them to base64. A drawback is that it takes more place.

(options) Custom filters

Allowing Newspaper creators to write custom html filters can allow to

The different filters that creators may want to write are :

  • delete : delete part of the page that are useless based on css selector (navbars, account, comments)
  • link rewrite : rewrite links so they are absolute. It can be useful if you want to keep external link, to other articles, to the comment sections, to the main page of the newspaper, et c
  • other filters : asking users what filter they want to write could be useful to know if features are lacking

delete filters seems the most useful and is easy to do as you can just provide a list of CSS filters.

The other need to be designed.

Minify

The html and css is minified to take the less place possible

unimplemented Images size could be reduced if they are too big. A format such as webp could also be used.

Inspiration

  • monolith, a CLI tool for saving complete web pages as a single HTML file
    • not really a library (yet ?)
    • lacks custom selector for removal of unwanted parts
    • not async

Libraries

lol-html is a great library and is designed to be fast as it is streaming through rather than parsing, storing and modifying it. Unfortunately, it isn't compatible with async downloads as the library relies on setting up executors (functions) that will be runned during the processing, and those functions can't be async.

Therefore, a library that seems to be less used, nipper, has been choosen. The Document type of this library is not Send, so it can't be used in two different Future. To circumvent this issue, the Document is recreated after each await. The overhead of doing so have not been measured yet.

Downloader

A downloader tool helps to write Newspaper interfaces. The Download Trait to allows the user to provide it's own downloader, it also helps to unit test as a dummy downloader can be created.