A first example as well as some documentation have been added The first example builds an article location and download the article as an html String. The documentation explains how it has been designed and what is the goal of the application as well as it's intended architecture
77 lines
3.1 KiB
Markdown
77 lines
3.1 KiB
Markdown
---
|
|
title: crieur-retrieve design
|
|
---
|
|
|
|
# Self-contained html
|
|
|
|
Exporting the article as a self-contained may be the easier and more reliable , as it keep the
|
|
original ui of the newspaper, and do not require to export in a different format.
|
|
|
|
Creating reusable methods to create a self-contained html page will make it easier to write
|
|
`Newspaper`s . Those methods would be part of a `crieur-retrieve-tool` library.
|
|
|
|
The `self_contained_html` function have been created to do this.
|
|
|
|
```rust
|
|
pub fn self_contained_html<S: AsRef<str>>()
|
|
html: S,
|
|
downloader: &dyn Fn(Url) -> Option<Bytes>,
|
|
) -> String
|
|
```
|
|
|
|
## Script removal
|
|
|
|
Nothing should be executed by the exported html page.
|
|
|
|
Scripts elements are contained in `<script>` tags as well as with event handlers (ex : `onclick`,
|
|
`onmousedown`).
|
|
|
|
## CSS
|
|
|
|
CSS should be retrieved and included in the web page.
|
|
|
|
To make the web pages minimal, it would be nice to remove all unused CSS, but that may be difficult technically.
|
|
|
|
## Images
|
|
|
|
All images should be included in the html page. It can be done by transforming them to base64.
|
|
A drawback is that it takes more place.
|
|
|
|
## (options) Custom filters
|
|
|
|
Allowing `Newspaper` creators to write custom html filters can allow to
|
|
|
|
The different filters that creators may want to write are :
|
|
|
|
- `delete` : delete part of the page that are useless based on css selector (navbars, account, comments)
|
|
- `link rewrite` : rewrite links so they are absolute. It can be useful if you want to keep external link, to other articles, to the comment sections, to the main page of the newspaper, et c
|
|
- other filters : asking users what filter they want to write could be useful to know if features are lacking
|
|
|
|
`delete` filters seems the most useful and is easy to do as you can just provide a list of CSS filters.
|
|
|
|
The other need to be designed.
|
|
|
|
## Minify
|
|
|
|
The html and css is minified to take the less place possible
|
|
|
|
**unimplemented** Images size could be reduced if they are too big. A format such as webp could
|
|
also be used.
|
|
|
|
## Inspiration
|
|
|
|
- [monolith](https://github.com/y2z/monolith), a CLI tool for saving complete web pages as a single HTML file
|
|
- not really a library (yet ?)
|
|
- lacks custom selector for removal of unwanted parts
|
|
- not async
|
|
|
|
## Libraries
|
|
|
|
[lol-html](https://github.com/cloudflare/lol-html) is a great library and is designed to be fast as it is streaming through rather than parsing, storing and modifying it. Unfortunately, it isn't compatible with async downloads as the library relies on setting up executors (functions) that will be runned during the processing, and those functions can't be async.
|
|
|
|
Therefore, a library that seems to be less used, [nipper](https://github.com/importcjj/nipper), has been choosen. The `Document` type of this library is not `Send`, so it can't be used in two different `Future`. To circumvent this issue, the `Document` is recreated after each `await`. The overhead of doing so have not been measured yet.
|
|
|
|
# Downloader
|
|
|
|
A `downloader` tool helps to write Newspaper interfaces. The `Download` `Trait` to allows the user to provide it's own `downloader`, it also helps to unit test as a dummy downloader can be created.
|