Releases · dobbersc/fundus-evaluation

This release provides updated evaluation results for the news scrapers in our evaluation pipeline. Instructions on reproducing the results can be found in the repository's README.md.

Results

The following table summarizes the overall performance of Fundus and evaluated scrapers in terms of averaged ROUGE-LSum precision, recall and F1-score and their standard deviation. In addition, we provide the scrapers' versions at their evaluation time. The table is sorted in descending order over the F1-score:

Fundus-Evaluation v0.2.0

Scraper	Precision	Recall	F1-Score	Version
Fundus	99.89_±0.57	96.75_±12.75	97.69_±9.75	0.4.1
Trafilatura	93.91_±12.89	96.85_±15.69	93.62_±16.73	1.12.0
news-please	97.95_±10.08	91.89_±16.15	93.39_±14.52	1.6.13
BTE	81.09_±19.41	98.23_±8.61	87.14_±15.48	/
jusText	86.51_±18.92	90.23_±20.61	86.96_±19.76	3.0.1
BoilerNet	85.96_±18.55	91.21_±19.15	86.52_±18.03	/
Boilerpipe	82.89_±20.65	82.11_±29.99	79.90_±25.86	1.3.0

Previous Results

Fundus-Evaluation v0.1.0

Scraper	Precision	Recall	F1-Score	Version
Fundus	99.89_±0.57	96.75_±12.75	97.69_±9.75	0.2.2
Trafilatura	90.54_±18.86	93.23_±23.81	89.81_±23.69	1.7.0
BTE	81.09_±19.41	98.23_±8.61	87.14_±15.48	/
jusText	86.51_±18.92	90.23_±20.61	86.96_±19.76	3.0.0
news-please	92.26_±12.40	86.38_±27.59	85.81_±23.29	1.5.44
BoilerNet	84.73_±20.82	90.66_±21.05	85.77_±20.28	/
Boilerpipe	82.89_±20.65	82.11_±29.99	79.90_±25.86	1.3.0

Cite

Please cite the following paper when using Fundus or building upon our work:

@inproceedings{dallabetta-etal-2024-fundus,
    title = "Fundus: A Simple-to-Use News Scraper Optimized for High Quality Extractions",
    author = "Dallabetta, Max  and
      Dobberstein, Conrad  and
      Breiding, Adrian  and
      Akbik, Alan",
    editor = "Cao, Yixin  and
      Feng, Yang  and
      Xiong, Deyi",
    booktitle = "Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)",
    month = aug,
    year = "2024",
    address = "Bangkok, Thailand",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.acl-demos.29",
    pages = "305--314",
    abstract = "This paper introduces Fundus, a user-friendly news scraper that enables users to obtain millions of high-quality news articles with just a few lines of code. Unlike existing news scrapers, we use manually crafted, bespoke content extractors that are specifically tailored to the formatting guidelines of each supported online newspaper. This allows us to optimize our scraping for quality such that retrieved news articles are textually complete and without HTML artifacts. Further, our framework combines both crawling (retrieving HTML from the web or large web archives) and content extraction into a single pipeline. By providing a unified interface for a predefined collection of newspapers, we aim to make Fundus broadly usable even for non-technical users. This paper gives an overview of the framework, discusses our design choices, and presents a comparative evaluation against other popular news scrapers. Our evaluation shows that Fundus yields significantly higher quality extractions (complete and artifact-free news articles) than prior work.The framework is available on GitHub under https://github.com/flairNLP/fundus and can be simply installed using pip.",
}

What's Changed

Update News Scrapers and Evaluation Results by @dobbersc in #10
Add "Contributing" and "Questions and Support" Section by @dobbersc in #11
Update our Paper Citation to ACL by @dobbersc in #12

Full Changelog: v0.1.0...v0.2.0

Results

Scraper

Precision

Recall

F1-Score

Version

Fundus

99.89_±0.57

96.75_±12.75

97.69_±9.75

0.2.2

Trafilatura

90.54_±18.86

93.23_±23.81

89.81_±23.69

1.7.0

BTE

81.09_±19.41

98.23_±8.61

87.14_±15.48

jusText

86.51_±18.92

90.23_±20.61

86.96_±19.76

3.0.0

news-please

92.26_±12.40

86.38_±27.59

85.81_±23.29

1.5.44

BoilerNet

84.73_±20.82

90.66_±21.05

85.77_±20.28

Boilerpipe

82.89_±20.65

82.11_±29.99

79.90_±25.86

1.3.0

Cite

Please cite the following paper when using Fundus or building upon our work:

@inproceedings{dallabetta-etal-2024-fundus,
    title = "Fundus: A Simple-to-Use News Scraper Optimized for High Quality Extractions",
    author = "Dallabetta, Max  and
      Dobberstein, Conrad  and
      Breiding, Adrian  and
      Akbik, Alan",
    editor = "Cao, Yixin  and
      Feng, Yang  and
      Xiong, Deyi",
    booktitle = "Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)",
    month = aug,
    year = "2024",
    address = "Bangkok, Thailand",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.acl-demos.29",
    pages = "305--314",
    abstract = "This paper introduces Fundus, a user-friendly news scraper that enables users to obtain millions of high-quality news articles with just a few lines of code. Unlike existing news scrapers, we use manually crafted, bespoke content extractors that are specifically tailored to the formatting guidelines of each supported online newspaper. This allows us to optimize our scraping for quality such that retrieved news articles are textually complete and without HTML artifacts. Further, our framework combines both crawling (retrieving HTML from the web or large web archives) and content extraction into a single pipeline. By providing a unified interface for a predefined collection of newspapers, we aim to make Fundus broadly usable even for non-technical users. This paper gives an overview of the framework, discusses our design choices, and presents a comparative evaluation against other popular news scrapers. Our evaluation shows that Fundus yields significantly higher quality extractions (complete and artifact-free news articles) than prior work.The framework is available on GitHub under https://github.com/flairNLP/fundus and can be simply installed using pip.",
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Results

Fundus-Evaluation v0.2.0

Fundus-Evaluation v0.1.0

Cite

What's Changed

Contributors

Results

Cite

New Contributors

Contributors

Releases: dobbersc/fundus-evaluation

v0.2.0

Results

Fundus-Evaluation v0.2.0

Fundus-Evaluation v0.1.0

Cite

What's Changed

Contributors

v0.1.0

Results

Cite

New Contributors

Contributors