Skip to content

Commit

Permalink
Documentation changes v0.9.2 (#604)
Browse files Browse the repository at this point in the history
* feat(doc): 📝 adding evaluation results

* feat(doc): 🚀 Documentation Update. Added Examples, documented new features
  • Loading branch information
AndyTheFactory committed Jan 16, 2024
1 parent c3976c7 commit 911f503
Show file tree
Hide file tree
Showing 13 changed files with 435 additions and 62 deletions.
48 changes: 34 additions & 14 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,12 +4,15 @@
[![Coverage status](https://coveralls.io/repos/github/AndyTheFactory/newspaper4k/badge.svg?branch=master)](https://coveralls.io/github/AndyTheFactory/newspaper4k)
[![Documentation Status](https://readthedocs.org/projects/newspaper4k/badge/?version=latest)](https://newspaper4k.readthedocs.io/en/latest/)

At the moment the Newspaper4k Project is a fork of the well known newspaper3k by [codelucas](https://github.com/codelucas/newspaper) which was not updated since Sept 2020. The initial goal of this fork is to keep the project alive and to add new features and fix bugs.
At the moment the Newspaper4k Project is a fork of the well known newspaper3k by [codelucas](https://github.com/codelucas/newspaper) which was not updated since September 2020. The initial goal of this fork is to keep the project alive and to add new features and fix bugs.

I have duplicated all issues on the original project and will try to fix them. If you have any issues or feature requests please open an issue here.

**Experimental ChatGPT helper bot for Newspaper4k:**
[![ChatGPT helper](docs/user_guide/assets/chatgpt_chat.png)](https://chat.openai.com/g/g-OxSqyKAhi-newspaper-4k-gpt)
| <!-- --> | <!-- --> |
|-------------|-------------|
| **Experimental ChatGPT helper bot for Newspaper4k:** | [![ChatGPT helper](docs/user_guide/assets/chatgpt_chat200x75.png)](https://chat.openai.com/g/g-OxSqyKAhi-newspaper-4k-gpt)|



## Python compatibility
- Recommended: Python 3.8+
Expand All @@ -29,10 +32,10 @@ You can start directly from the command line, using the included CLI:
python -m newspaper --url="https://edition.cnn.com/2023/11/17/success/job-seekers-use-ai/index.html" --language=en --output-format=json --output-file=article.json

```

More information about the CLI can be found in the [CLI documentation](https://newspaper4k.readthedocs.io/en/latest/user_guide/cli_reference.html).
## Using the Python API

Alternatively, you can use the Python API:
Alternatively, you can use Newspaper4k in Python:

### Processing one article / url at a time

Expand Down Expand Up @@ -82,22 +85,22 @@ import newspaper

cnn_paper = newspaper.build('http://cnn.com', number_threads=3)
print(cnn_paper.category_urls())
> ['https://cnn.com', 'https://money.cnn.com', 'https://arabic.cnn.com',
> 'https://cnnespanol.cnn.com', 'http://edition.cnn.com',
> 'https://edition.cnn.com', 'https://us.cnn.com', 'https://www.cnn.com']
>> ['https://cnn.com', 'https://money.cnn.com', 'https://arabic.cnn.com',
>> 'https://cnnespanol.cnn.com', 'http://edition.cnn.com',
>> 'https://edition.cnn.com', 'https://us.cnn.com', 'https://www.cnn.com']

article_urls = [article.url for article in cnn_paper.articles]
print(article_urls[:3])
> ['https://arabic.cnn.com/middle-east/article/2023/10/30/number-of-hostages-held-in-gaza-now-up-to-239-idf-spokesperson',
> 'https://arabic.cnn.com/middle-east/video/2023/10/30/v146619-sotu-sullivan-hostage-negotiations',
> 'https://arabic.cnn.com/middle-east/article/2023/10/29/norwegian-pm-israel-gaza']
>> ['https://arabic.cnn.com/middle-east/article/2023/10/30/number-of-hostages-held-in-gaza-now-up-to-239-idf-spokesperson',
>> 'https://arabic.cnn.com/middle-east/video/2023/10/30/v146619-sotu-sullivan-hostage-negotiations',
>> 'https://arabic.cnn.com/middle-east/article/2023/10/29/norwegian-pm-israel-gaza']

article = cnn_paper.articles[0]
article.download()
article.parse()

print(article.title)
> المتحدث باسم الجيش الإسرائيلي: عدد الرهائن المحتجزين في غزة يصل إلى
>> المتحدث باسم الجيش الإسرائيلي: عدد الرهائن المحتجزين في غزة يصل إلى

```
Or if you want to get bulk articles from the website (have in mind that this could take a long time and could get your IP blocked by the newssite):
Expand Down Expand Up @@ -130,15 +133,15 @@ article.download()
article.parse()

print(article.title)
> 晶片大战:台湾厂商助攻华为突破美国封锁?
>> 晶片大战:台湾厂商助攻华为突破美国封锁?

if article.config.use_meta_language:
# If we use the autodetected language, this config attribute will be true
print(article.meta_lang)
else:
print(article.config.language)

> zh
>> zh
```

# Docs
Expand All @@ -158,8 +161,25 @@ detailed guides using newspaper.
- Autoatic article text summarization
- Author extraction from text
- Easy to use Command Line Interface (`python -m newspaper....`)
- Output in various formats (json, csv, text)
- Works in 10+ languages (English, Chinese, German, Arabic, \...)

# Evaluation

## Evaluation Results


Using the dataset from [ScrapingHub](https://github.com/scrapinghub/article-extraction-benchmark) I created an [evaluator script](tests/evaluation/evaluate.py) that compares the performance of newspaper against it's previous versions. This way we can see how newspaper updates improve or worsen the performance of the library.

| Version | Corpus BLEU Score | Corpus Precision Score | Corpus Recall Score | Corpus F1 Score |
|--------------------|-------------------|------------------------|---------------------|-----------------|
| Newspaper3k 0.2.8 | 0.8660 | 0.9128 | 0.9071 | 0.9100 |
| Newspaper4k 0.9.0 | 0.9212 | 0.8992 | 0.9336 | 0.9161 |
| Newspaper4k 0.9.1 | 0.9224 | 0.8895 | 0.9242 | 0.9065 |
| Newspaper4k 0.9.2 | 0.9426 | 0.9070 | 0.9087 | 0.9078 |

Precision, Recall and F1 are computed using overlap of shingles with n-grams of size 4. The corpus BLEU score is computed using the [nltk's bleu_score](https://www.nltk.org/api/nltk.translate.bleu).

# Requirements and dependencies

Following system packages are required:
Expand Down
114 changes: 107 additions & 7 deletions docs/user_guide/advanced.rst
Original file line number Diff line number Diff line change
Expand Up @@ -11,30 +11,58 @@ Multi-threading article downloads

**Downloading articles one at a time is slow.** But spamming a single news source
like cnn.com with tons of threads or with ASYNC-IO will cause rate limiting
and also doing that is very mean.
and also doing that can lead to your ip to be blocked by the site.

We solve this problem by allocating 1-2 threads per news source to both greatly
speed up the download time while being respectful.

.. code-block:: python
import newspaper
from newspaper import news_pool
from newspaper.mthreading import fetch_news
slate_paper = newspaper.build('http://slate.com')
tc_paper = newspaper.build('http://techcrunch.com')
espn_paper = newspaper.build('http://espn.com')
papers = [slate_paper, tc_paper, espn_paper]
news_pool.set(papers, threads_per_source=2) # (3*2) = 6 threads total
news_pool.join()
results = fetch_news(papers, threads=4)
#At this point, you can safely assume that download() has been
#called on every single article for all 3 sources.
print(slate_paper.articles[10].html)
print(slate_paper.articles[10].tite)
#'<html> ...'
In addition to :any:`Source` objects, :any:`fetch_news` also accepts :any:`Article` objects or simple urls.

.. code-block:: python
article_urls = [f'https://abcnews.go.com/US/x/story?id={i}' for i in range(106379500, 106379520)]
articles = [Article(url=u) for u in article_urls]
results = fetch_news(articles, threads=4)
urls = [
"https://www.foxnews.com/media/homeowner-new-florida-bill-close-squatting-loophole-return-some-fairness",
"https://edition.cnn.com/2023/12/27/middleeast/dutch-diplomat-humanitarian-aid-gaza-sigrid-kaag-intl/index.html",
]
results = fetch_news(urls, threads=4)
# or everything at once
papers = [slate_paper, tc_paper, espn_paper]
papers.extend(articles)
papers.extend(urls)
results = fetch_news(papers, threads=4)
**Note:** in previous versions of newspaper, this could be done with the ``news_pool`` call, but it was not very robust
and was replaced with a ThreadPoolExecutor implementation.

Keeping just the Html of the main body article
------------------------------------------------

Expand Down Expand Up @@ -191,12 +219,84 @@ The full available options are available under the :any:`Configuration` section
Caching
-------

TODO
The Newspaper4k library provides a simple caching mechanism that can be used to avoid repeatedly downloading the same article. Additionally, when building an :any:`Source` object, the category url detection is cached for 24 hours.

Both mechanisms are enabled by default. The article caching is controlled by the ``memoize_articles`` parameter in the :any:`newspaper.build()` function or, alternatively, when creating an :any:`Source` object, the ``memoize_articles`` parameter in the constructor. Setting it to ``False`` will disable the caching mechanism.

The category detection caching is controlled by `utils.cache_disk.enabled` setting. This disables the caching decorator on the ``Source._get_category_urls(..)`` method.

For example:

.. code-block:: python
import newspaper
from newspaper import utils
cbs_paper = newspaper.build('http://cbs.com')
# Disable article caching
utils.cache_disk.enabled = False
cbs_paper2 = newspaper.build('http://cbs.com') # The categories will be re-detected
# Enable article caching
utils.cache_disk.enabled = True
cbs_paper3 = newspaper.build('http://cbs.com') # The cached category urls will be loaded
Proxy Usage
--------------

TODO
Often times websites block repeated access from a single IP address. Or, some websites might limit access from certain geographic locations (due to legal reasons, etc.). To bypass these restrictions, you can use a proxy. Newspaper supports using a proxy by passing the ``proxies`` parameter to the :any:`Article` object's constructor or :any:`Source` object's constructor. The ``proxies`` parameter should be a dictionary, as required by the ``requests library``, with the following format:

.. code-block:: python
from newspaper import Article
# Define your proxy
proxies = {
'http': 'http://your_http_proxy:port',
'https': 'https://your_https_proxy:port'
}
# URL of the article you want to scrape
url = 'https://abcnews.go.com/Technology/wireStory/indonesias-mount-marapi-erupts-leading-evacuations-reported-casualties-106358667'
# Create an Article object, passing the proxies parameter
article = Article(url, proxies=proxies)
# Download and parse the article
article.download()
article.parse()
# Access the article's text, keywords, and summary
print("Title:", article.title)
print("Text:", article.text)
or the shorter version:

.. code-block:: python
from newspaper import article
# Define your proxy
proxies = {
'http': 'http://your_http_proxy:port',
'https': 'https://your_https_proxy:port'
}
# URL of the article you want to scrape
url = 'https://abcnews.go.com/Technology/wireStory/indonesias-mount-marapi-erupts-leading-evacuations-reported-casualties-106358667'
# Create an Article object,
article = article(url, proxies=proxies)
# Access the article's text, keywords, and summary
print("Title:", article.title)
print("Text:", article.text)
Cookie Usage (simulate logged in user)
--------------------------------------
Expand Down
23 changes: 23 additions & 0 deletions docs/user_guide/api_reference.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,20 @@ Newspaper API
.. autosummary::
:toctree: generated

Function calls
--------------

.. autofunction:: newspaper.article

.. autofunction:: newspaper.build

.. autofunction:: newspaper.mthreading.fetch_news

.. autofunction:: newspaper.hot

.. autofunction:: newspaper.languages


Configuration
-------------

Expand Down Expand Up @@ -44,7 +58,9 @@ Source
.. automethod:: newspaper.Source.purge_articles()
.. automethod:: newspaper.Source.feeds_to_articles()
.. automethod:: newspaper.Source.categories_to_articles()
.. automethod:: newspaper.Source.generate_articles()
.. automethod:: newspaper.Source.download_articles()
.. automethod:: newspaper.Source.download()
.. automethod:: newspaper.Source.size()

Category
Expand All @@ -55,3 +71,10 @@ Category
Feed
----
.. autoclass:: newspaper.source.Feed


Exceptions
----------
.. autoclass:: newspaper.ArticleException

.. autoclass:: newspaper.ArticleBinaryDataException
Binary file added docs/user_guide/assets/chatgpt_chat200x75.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
File renamed without changes
Loading

0 comments on commit 911f503

Please sign in to comment.