This release provides updated evaluation results for the news scrapers in our evaluation pipeline. Instructions on reproducing the results can be found in the repository's README.md
.
Results
The following table summarizes the overall performance of Fundus and evaluated scrapers in terms of averaged ROUGE-LSum precision, recall and F1-score and their standard deviation. In addition, we provide the scrapers' versions at their evaluation time. The table is sorted in descending order over the F1-score:
Fundus-Evaluation v0.2.0
Scraper | Precision | Recall | F1-Score | Version |
---|---|---|---|---|
Fundus | 99.89±0.57 | 96.75±12.75 | 97.69±9.75 | 0.4.1 |
Trafilatura | 93.91±12.89 | 96.85±15.69 | 93.62±16.73 | 1.12.0 |
news-please | 97.95±10.08 | 91.89±16.15 | 93.39±14.52 | 1.6.13 |
BTE | 81.09±19.41 | 98.23±8.61 | 87.14±15.48 | / |
jusText | 86.51±18.92 | 90.23±20.61 | 86.96±19.76 | 3.0.1 |
BoilerNet | 85.96±18.55 | 91.21±19.15 | 86.52±18.03 | / |
Boilerpipe | 82.89±20.65 | 82.11±29.99 | 79.90±25.86 | 1.3.0 |
Previous Results
Fundus-Evaluation v0.1.0
Scraper | Precision | Recall | F1-Score | Version |
---|---|---|---|---|
Fundus | 99.89±0.57 | 96.75±12.75 | 97.69±9.75 | 0.2.2 |
Trafilatura | 90.54±18.86 | 93.23±23.81 | 89.81±23.69 | 1.7.0 |
BTE | 81.09±19.41 | 98.23±8.61 | 87.14±15.48 | / |
jusText | 86.51±18.92 | 90.23±20.61 | 86.96±19.76 | 3.0.0 |
news-please | 92.26±12.40 | 86.38±27.59 | 85.81±23.29 | 1.5.44 |
BoilerNet | 84.73±20.82 | 90.66±21.05 | 85.77±20.28 | / |
Boilerpipe | 82.89±20.65 | 82.11±29.99 | 79.90±25.86 | 1.3.0 |
Cite
Please cite the following paper when using Fundus or building upon our work:
@inproceedings{dallabetta-etal-2024-fundus,
title = "Fundus: A Simple-to-Use News Scraper Optimized for High Quality Extractions",
author = "Dallabetta, Max and
Dobberstein, Conrad and
Breiding, Adrian and
Akbik, Alan",
editor = "Cao, Yixin and
Feng, Yang and
Xiong, Deyi",
booktitle = "Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)",
month = aug,
year = "2024",
address = "Bangkok, Thailand",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2024.acl-demos.29",
pages = "305--314",
abstract = "This paper introduces Fundus, a user-friendly news scraper that enables users to obtain millions of high-quality news articles with just a few lines of code. Unlike existing news scrapers, we use manually crafted, bespoke content extractors that are specifically tailored to the formatting guidelines of each supported online newspaper. This allows us to optimize our scraping for quality such that retrieved news articles are textually complete and without HTML artifacts. Further, our framework combines both crawling (retrieving HTML from the web or large web archives) and content extraction into a single pipeline. By providing a unified interface for a predefined collection of newspapers, we aim to make Fundus broadly usable even for non-technical users. This paper gives an overview of the framework, discusses our design choices, and presents a comparative evaluation against other popular news scrapers. Our evaluation shows that Fundus yields significantly higher quality extractions (complete and artifact-free news articles) than prior work.The framework is available on GitHub under https://github.com/flairNLP/fundus and can be simply installed using pip.",
}
What's Changed
- Update News Scrapers and Evaluation Results by @dobbersc in #10
- Add "Contributing" and "Questions and Support" Section by @dobbersc in #11
- Update our Paper Citation to ACL by @dobbersc in #12
Full Changelog: v0.1.0...v0.2.0