trafilatura
PublicPython & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML
article-extractorcorpus-buildercorpus-toolscrawlerhtml-to-markdownhtml2textllmnews-aggregatornews-crawlernlp
Creat:2019-04-08T19:38:48
Update:2025-03-26T19:40:42
https://trafilatura.readthedocs.io
4.6K
Stars
15
Stars Increase