Skip to content

Releases: dobbersc/fundus-evaluation

v0.2.0

14 Aug 04:24
d43db54
Compare
Choose a tag to compare

This release provides updated evaluation results for the news scrapers in our evaluation pipeline. Instructions on reproducing the results can be found in the repository's README.md.

Results

The following table summarizes the overall performance of Fundus and evaluated scrapers in terms of averaged ROUGE-LSum precision, recall and F1-score and their standard deviation. In addition, we provide the scrapers' versions at their evaluation time. The table is sorted in descending order over the F1-score:

Fundus-Evaluation v0.2.0

Scraper Precision Recall F1-Score Version
Fundus 99.89±0.57 96.75±12.75 97.69±9.75 0.4.1
Trafilatura 93.91±12.89 96.85±15.69 93.62±16.73 1.12.0
news-please 97.95±10.08 91.89±16.15 93.39±14.52 1.6.13
BTE 81.09±19.41 98.23±8.61 87.14±15.48 /
jusText 86.51±18.92 90.23±20.61 86.96±19.76 3.0.1
BoilerNet 85.96±18.55 91.21±19.15 86.52±18.03 /
Boilerpipe 82.89±20.65 82.11±29.99 79.90±25.86 1.3.0
Previous Results

Fundus-Evaluation v0.1.0

Scraper Precision Recall F1-Score Version
Fundus 99.89±0.57 96.75±12.75 97.69±9.75 0.2.2
Trafilatura 90.54±18.86 93.23±23.81 89.81±23.69 1.7.0
BTE 81.09±19.41 98.23±8.61 87.14±15.48 /
jusText 86.51±18.92 90.23±20.61 86.96±19.76 3.0.0
news-please 92.26±12.40 86.38±27.59 85.81±23.29 1.5.44
BoilerNet 84.73±20.82 90.66±21.05 85.77±20.28 /
Boilerpipe 82.89±20.65 82.11±29.99 79.90±25.86 1.3.0

Cite

Please cite the following paper when using Fundus or building upon our work:

@inproceedings{dallabetta-etal-2024-fundus,
    title = "Fundus: A Simple-to-Use News Scraper Optimized for High Quality Extractions",
    author = "Dallabetta, Max  and
      Dobberstein, Conrad  and
      Breiding, Adrian  and
      Akbik, Alan",
    editor = "Cao, Yixin  and
      Feng, Yang  and
      Xiong, Deyi",
    booktitle = "Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)",
    month = aug,
    year = "2024",
    address = "Bangkok, Thailand",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.acl-demos.29",
    pages = "305--314",
    abstract = "This paper introduces Fundus, a user-friendly news scraper that enables users to obtain millions of high-quality news articles with just a few lines of code. Unlike existing news scrapers, we use manually crafted, bespoke content extractors that are specifically tailored to the formatting guidelines of each supported online newspaper. This allows us to optimize our scraping for quality such that retrieved news articles are textually complete and without HTML artifacts. Further, our framework combines both crawling (retrieving HTML from the web or large web archives) and content extraction into a single pipeline. By providing a unified interface for a predefined collection of newspapers, we aim to make Fundus broadly usable even for non-technical users. This paper gives an overview of the framework, discusses our design choices, and presents a comparative evaluation against other popular news scrapers. Our evaluation shows that Fundus yields significantly higher quality extractions (complete and artifact-free news articles) than prior work.The framework is available on GitHub under https://github.com/flairNLP/fundus and can be simply installed using pip.",
}

What's Changed

  • Update News Scrapers and Evaluation Results by @dobbersc in #10
  • Add "Contributing" and "Questions and Support" Section by @dobbersc in #11
  • Update our Paper Citation to ACL by @dobbersc in #12

Full Changelog: v0.1.0...v0.2.0

v0.1.0

10 Aug 18:25
fb1bcc9
Compare
Choose a tag to compare

This is the initial release to reproduce the evaluation results from our paper "Fundus: A Simple-to-Use News Scraper Optimized for High Quality Extractions". Instructions on reproducing the results can be found in the repository's README.md.

Results

The following table summarizes the overall performance of Fundus and evaluated scrapers in terms of averaged ROUGE-LSum precision, recall and F1-score and their standard deviation. In addition, we provide the scrapers' versions at their evaluation time. The table is sorted in descending order over the F1-score:

Scraper Precision Recall F1-Score Version
Fundus 99.89±0.57 96.75±12.75 97.69±9.75 0.2.2
Trafilatura 90.54±18.86 93.23±23.81 89.81±23.69 1.7.0
BTE 81.09±19.41 98.23±8.61 87.14±15.48 /
jusText 86.51±18.92 90.23±20.61 86.96±19.76 3.0.0
news-please 92.26±12.40 86.38±27.59 85.81±23.29 1.5.44
BoilerNet 84.73±20.82 90.66±21.05 85.77±20.28 /
Boilerpipe 82.89±20.65 82.11±29.99 79.90±25.86 1.3.0

Cite

Please cite the following paper when using Fundus or building upon our work:

@inproceedings{dallabetta-etal-2024-fundus,
    title = "Fundus: A Simple-to-Use News Scraper Optimized for High Quality Extractions",
    author = "Dallabetta, Max  and
      Dobberstein, Conrad  and
      Breiding, Adrian  and
      Akbik, Alan",
    editor = "Cao, Yixin  and
      Feng, Yang  and
      Xiong, Deyi",
    booktitle = "Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)",
    month = aug,
    year = "2024",
    address = "Bangkok, Thailand",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.acl-demos.29",
    pages = "305--314",
    abstract = "This paper introduces Fundus, a user-friendly news scraper that enables users to obtain millions of high-quality news articles with just a few lines of code. Unlike existing news scrapers, we use manually crafted, bespoke content extractors that are specifically tailored to the formatting guidelines of each supported online newspaper. This allows us to optimize our scraping for quality such that retrieved news articles are textually complete and without HTML artifacts. Further, our framework combines both crawling (retrieving HTML from the web or large web archives) and content extraction into a single pipeline. By providing a unified interface for a predefined collection of newspapers, we aim to make Fundus broadly usable even for non-technical users. This paper gives an overview of the framework, discusses our design choices, and presents a comparative evaluation against other popular news scrapers. Our evaluation shows that Fundus yields significantly higher quality extractions (complete and artifact-free news articles) than prior work.The framework is available on GitHub under https://github.com/flairNLP/fundus and can be simply installed using pip.",
}

New Contributors

Full Changelog: https://github.com/dobbersc/fundus-evaluation/commits/v0.1.0