A first example as well as some documentation have been added The first example builds an article location and download the article as an html String. The documentation explains how it has been designed and what is the goal of the application as well as it's intended architecture
3.1 KiB
title |
---|
crieur-retrieve design |
Self-contained html
Exporting the article as a self-contained may be the easier and more reliable , as it keep the original ui of the newspaper, and do not require to export in a different format.
Creating reusable methods to create a self-contained html page will make it easier to write
Newspaper
s . Those methods would be part of a crieur-retrieve-tool
library.
The self_contained_html
function have been created to do this.
pub fn self_contained_html<S: AsRef<str>>()
html: S,
downloader: &dyn Fn(Url) -> Option<Bytes>,
) -> String
Script removal
Nothing should be executed by the exported html page.
Scripts elements are contained in <script>
tags as well as with event handlers (ex : onclick
,
onmousedown
).
CSS
CSS should be retrieved and included in the web page.
To make the web pages minimal, it would be nice to remove all unused CSS, but that may be difficult technically.
Images
All images should be included in the html page. It can be done by transforming them to base64. A drawback is that it takes more place.
(options) Custom filters
Allowing Newspaper
creators to write custom html filters can allow to
The different filters that creators may want to write are :
delete
: delete part of the page that are useless based on css selector (navbars, account, comments)link rewrite
: rewrite links so they are absolute. It can be useful if you want to keep external link, to other articles, to the comment sections, to the main page of the newspaper, et c- other filters : asking users what filter they want to write could be useful to know if features are lacking
delete
filters seems the most useful and is easy to do as you can just provide a list of CSS filters.
The other need to be designed.
Minify
The html and css is minified to take the less place possible
unimplemented Images size could be reduced if they are too big. A format such as webp could also be used.
Inspiration
- monolith, a CLI tool for saving complete web pages as a single HTML file
- not really a library (yet ?)
- lacks custom selector for removal of unwanted parts
- not async
Libraries
lol-html is a great library and is designed to be fast as it is streaming through rather than parsing, storing and modifying it. Unfortunately, it isn't compatible with async downloads as the library relies on setting up executors (functions) that will be runned during the processing, and those functions can't be async.
Therefore, a library that seems to be less used, nipper, has been choosen. The Document
type of this library is not Send
, so it can't be used in two different Future
. To circumvent this issue, the Document
is recreated after each await
. The overhead of doing so have not been measured yet.
Downloader
A downloader
tool helps to write Newspaper interfaces. The Download
Trait
to allows the user to provide it's own downloader
, it also helps to unit test as a dummy downloader can be created.