url whitelist match #11

Open
opened 2021-04-26 17:59:37 +02:00 by koalp · 0 comments
Owner

For now, only the hostname is checked. Some URLs might have the right hostname but not be articles (configuration pages, et c).

It would be nice to have a mechanism to avoid to avoid downloading malformed pages.

Maybe it could be part of Metadata for newspapers, as an Option.

For the implementation, it could be :

  • an url whitelist
  • an url blacklist
  • a whitelist for the html content. (ex: must have body > div.article, et c)
  • most promising a Trait with a function checking the page (url, body). It is the more generic but would require helpers function (¿using other ideas in the list?) to help it's implementation.
For now, only the hostname is checked. Some URLs might have the right hostname but not be articles (configuration pages, et c). It would be nice to have a mechanism to avoid to avoid downloading malformed pages. Maybe it could be part of `Metadata` for newspapers, as an `Option`. For the implementation, it could be : - an url whitelist - an url blacklist - a whitelist for the html content. (ex: must have `body > div.article`, et c) - **most promising** a `Trait` with a function checking the page (url, body). It is the more generic but would require helpers function (¿using other ideas in the list?) to help it's implementation.
koalp added the
status
review_needed
type
enhancement
labels 2021-04-26 17:59:37 +02:00
koalp added
status
accepted
and removed
status
review_needed
labels 2021-05-13 20:53:49 +02:00
koalp added this to the v0.1.2 - further bug resolution milestone 2021-05-13 20:53:58 +02:00
Sign in to join this conversation.
No description provided.