feat(doc): 🚀 Documentation Update. Added Examples, documented new fea…

…tures
AndyTheFactory · Jan 16, 2024 · 10d77d2 · 10d77d2
1 parent 383ffcc
commit 10d77d2
Show file tree

Hide file tree

Showing 13 changed files with 418 additions and 62 deletions.
diff --git a/README.md b/README.md
@@ -4,12 +4,15 @@
 [![Coverage status](https://coveralls.io/repos/github/AndyTheFactory/newspaper4k/badge.svg?branch=master)](https://coveralls.io/github/AndyTheFactory/newspaper4k)
 [![Documentation Status](https://readthedocs.org/projects/newspaper4k/badge/?version=latest)](https://newspaper4k.readthedocs.io/en/latest/)
 
-At the moment the Newspaper4k Project is a fork of the well known newspaper3k  by [codelucas](https://github.com/codelucas/newspaper) which was not updated since Sept 2020. The initial goal of this fork is to keep the project alive and to add new features and fix bugs.
+At the moment the Newspaper4k Project is a fork of the well known newspaper3k  by [codelucas](https://github.com/codelucas/newspaper) which was not updated since September 2020. The initial goal of this fork is to keep the project alive and to add new features and fix bugs.
 
 I have duplicated all issues on the original project and will try to fix them. If you have any issues or feature requests please open an issue here.
 
-**Experimental ChatGPT helper bot for Newspaper4k:**
-[![ChatGPT helper](docs/user_guide/assets/chatgpt_chat.png)](https://chat.openai.com/g/g-OxSqyKAhi-newspaper-4k-gpt)
+| <!-- -->    | <!-- -->    |
+|-------------|-------------|
+| **Experimental ChatGPT helper bot for Newspaper4k:**         | [![ChatGPT helper](docs/user_guide/assets/chatgpt_chat200x75.png)](https://chat.openai.com/g/g-OxSqyKAhi-newspaper-4k-gpt)|
+
+
 
 ## Python compatibility
     - Recommended: Python 3.8+
@@ -29,10 +32,10 @@ You can start directly from the command line, using the included CLI:
 python -m newspaper --url="https://edition.cnn.com/2023/11/17/success/job-seekers-use-ai/index.html" --language=en --output-format=json --output-file=article.json
 
 ```
-
+More information about the CLI can be found in the [CLI documentation](https://newspaper4k.readthedocs.io/en/latest/user_guide/cli_reference.html).
 ## Using the Python API
 
-Alternatively, you can use the Python API:
+Alternatively, you can use Newspaper4k in Python:
 
 ### Processing one article / url at a time
 
@@ -82,22 +85,22 @@ import newspaper
 
 cnn_paper = newspaper.build('http://cnn.com', number_threads=3)
 print(cnn_paper.category_urls())
-> ['https://cnn.com', 'https://money.cnn.com', 'https://arabic.cnn.com',
-> 'https://cnnespanol.cnn.com', 'http://edition.cnn.com',
-> 'https://edition.cnn.com', 'https://us.cnn.com', 'https://www.cnn.com']
+>> ['https://cnn.com', 'https://money.cnn.com', 'https://arabic.cnn.com',
+>> 'https://cnnespanol.cnn.com', 'http://edition.cnn.com',
+>> 'https://edition.cnn.com', 'https://us.cnn.com', 'https://www.cnn.com']
 
 article_urls = [article.url for article in cnn_paper.articles]
 print(article_urls[:3])
-> ['https://arabic.cnn.com/middle-east/article/2023/10/30/number-of-hostages-held-in-gaza-now-up-to-239-idf-spokesperson',
-> 'https://arabic.cnn.com/middle-east/video/2023/10/30/v146619-sotu-sullivan-hostage-negotiations',
-> 'https://arabic.cnn.com/middle-east/article/2023/10/29/norwegian-pm-israel-gaza']
+>> ['https://arabic.cnn.com/middle-east/article/2023/10/30/number-of-hostages-held-in-gaza-now-up-to-239-idf-spokesperson',
+>> 'https://arabic.cnn.com/middle-east/video/2023/10/30/v146619-sotu-sullivan-hostage-negotiations',
+>> 'https://arabic.cnn.com/middle-east/article/2023/10/29/norwegian-pm-israel-gaza']
 
 article = cnn_paper.articles[0]
 article.download()
 article.parse()
 
 print(article.title)
-> المتحدث باسم الجيش الإسرائيلي: عدد الرهائن المحتجزين في غزة يصل إلى
+>> المتحدث باسم الجيش الإسرائيلي: عدد الرهائن المحتجزين في غزة يصل إلى
 
 ```
 Or if you want to get bulk articles from the website (have in mind that this could take a long time and could get your IP blocked by the newssite):
@@ -130,15 +133,15 @@ article.download()
 article.parse()
 
 print(article.title)
-> 晶片大战：台湾厂商助攻华为突破美国封锁？
+>> 晶片大战：台湾厂商助攻华为突破美国封锁？
 
 if article.config.use_meta_language:
   # If we use the autodetected language, this config attribute will be true
   print(article.meta_lang)
 else:
   print(article.config.language)
 
-> zh
+>> zh
 ```
 
 # Docs

diff --git a/docs/user_guide/advanced.rst b/docs/user_guide/advanced.rst
@@ -11,30 +11,58 @@ Multi-threading article downloads
 
 **Downloading articles one at a time is slow.** But spamming a single news source
 like cnn.com with tons of threads or with ASYNC-IO will cause rate limiting
-and also doing that is very mean.
+and also doing that can lead to your ip to be blocked by the site.
 
 We solve this problem by allocating 1-2 threads per news source to both greatly
 speed up the download time while being respectful.
 
 .. code-block:: python
 
     import newspaper
-    from newspaper import news_pool
+    from newspaper.mthreading import fetch_news
 
     slate_paper = newspaper.build('http://slate.com')
     tc_paper = newspaper.build('http://techcrunch.com')
     espn_paper = newspaper.build('http://espn.com')
 
     papers = [slate_paper, tc_paper, espn_paper]
-    news_pool.set(papers, threads_per_source=2) # (3*2) = 6 threads total
-    news_pool.join()
+    results = fetch_news(papers, threads=4)
+
 
     #At this point, you can safely assume that download() has been
     #called on every single article for all 3 sources.
 
-    print(slate_paper.articles[10].html)
+    print(slate_paper.articles[10].tite)
     #'<html> ...'
 
+
+In addition to :any:`Source` objects, :any:`fetch_news` also accepts :any:`Article` objects or simple urls.
+
+.. code-block:: python
+
+    article_urls = [f'https://abcnews.go.com/US/x/story?id={i}' for i in range(106379500, 106379520)]
+    articles = [Article(url=u) for u in article_urls]
+
+    results = fetch_news(articles, threads=4)
+
+    urls = [
+        "https://www.foxnews.com/media/homeowner-new-florida-bill-close-squatting-loophole-return-some-fairness",
+        "https://edition.cnn.com/2023/12/27/middleeast/dutch-diplomat-humanitarian-aid-gaza-sigrid-kaag-intl/index.html",
+    ]
+
+    results = fetch_news(urls, threads=4)
+
+    # or everything at once
+    papers = [slate_paper, tc_paper, espn_paper]
+    papers.extend(articles)
+    papers.extend(urls)
+
+    results = fetch_news(papers, threads=4)
+
+
+**Note:** in previous versions of newspaper, this could be done with the ``news_pool`` call, but it was not very robust
+and was replaced with a ThreadPoolExecutor implementation.
+
 Keeping just the Html of the  main body article
 ------------------------------------------------
 
@@ -191,12 +219,84 @@ The full available options are available under the :any:`Configuration` section
 Caching
 -------
 
-TODO
+The Newspaper4k library provides a simple caching mechanism that can be used to avoid repeatedly downloading the same article. Additionally, when building an :any:`Source` object, the category url detection is cached for 24 hours.
+
+Both mechanisms are enabled by default. The article caching is controlled by the ``memoize_articles`` parameter in the :any:`newspaper.build()` function or, alternatively, when creating an :any:`Source` object, the ``memoize_articles`` parameter in the constructor. Setting it to ``False`` will disable the caching mechanism.
+
+The category detection caching is controlled by `utils.cache_disk.enabled` setting. This disables the caching decorator on the ``Source._get_category_urls(..)`` method.
+
+For example:
+
+.. code-block:: python
+
+    import newspaper
+    from newspaper import utils
+
+    cbs_paper = newspaper.build('http://cbs.com')
+
+    # Disable article caching
+    utils.cache_disk.enabled = False
+
+    cbs_paper2 = newspaper.build('http://cbs.com') # The categories will be re-detected
+
+    # Enable article caching
+    utils.cache_disk.enabled = True
+
+    cbs_paper3 = newspaper.build('http://cbs.com') # The cached category urls will be loaded
+
+
 
 Proxy Usage
 --------------
 
-TODO
+Often times websites block repeated access from a single IP address. Or, some websites might limit access from certain geographic locations (due to legal reasons, etc.). To bypass these restrictions, you can use a proxy. Newspaper supports using a proxy by passing the ``proxies`` parameter to the :any:`Article` object's constructor or :any:`Source` object's constructor. The ``proxies`` parameter should be a dictionary, as required by the ``requests library``,  with the following format:
+
+.. code-block:: python
+
+    from newspaper import Article
+
+    # Define your proxy
+    proxies = {
+        'http': 'http://your_http_proxy:port',
+        'https': 'https://your_https_proxy:port'
+    }
+
+    # URL of the article you want to scrape
+    url = 'https://abcnews.go.com/Technology/wireStory/indonesias-mount-marapi-erupts-leading-evacuations-reported-casualties-106358667'
+
+    # Create an Article object, passing the proxies parameter
+    article = Article(url, proxies=proxies)
+
+    # Download and parse the article
+    article.download()
+    article.parse()
+
+    # Access the article's text, keywords, and summary
+    print("Title:", article.title)
+    print("Text:", article.text)
+
+or the shorter version:
+
+.. code-block:: python
+
+    from newspaper import article
+
+    # Define your proxy
+    proxies = {
+        'http': 'http://your_http_proxy:port',
+        'https': 'https://your_https_proxy:port'
+    }
+
+    # URL of the article you want to scrape
+    url = 'https://abcnews.go.com/Technology/wireStory/indonesias-mount-marapi-erupts-leading-evacuations-reported-casualties-106358667'
+
+    # Create an Article object,
+    article = article(url, proxies=proxies)
+
+    # Access the article's text, keywords, and summary
+    print("Title:", article.title)
+    print("Text:", article.text)
+
 
 Cookie Usage (simulate logged in user)
 --------------------------------------

diff --git a/docs/user_guide/api_reference.rst b/docs/user_guide/api_reference.rst
@@ -6,6 +6,20 @@ Newspaper API
 .. autosummary::
    :toctree: generated
 
+Function calls
+--------------
+
+.. autofunction:: newspaper.article
+
+.. autofunction:: newspaper.build
+
+.. autofunction:: newspaper.mthreading.fetch_news
+
+.. autofunction:: newspaper.hot
+
+.. autofunction:: newspaper.languages
+
+
 Configuration
 -------------
 
@@ -44,7 +58,9 @@ Source
 .. automethod:: newspaper.Source.purge_articles()
 .. automethod:: newspaper.Source.feeds_to_articles()
 .. automethod:: newspaper.Source.categories_to_articles()
+.. automethod:: newspaper.Source.generate_articles()
 .. automethod:: newspaper.Source.download_articles()
+.. automethod:: newspaper.Source.download()
 .. automethod:: newspaper.Source.size()
 
 Category
@@ -55,3 +71,10 @@ Category
 Feed
 ----
 .. autoclass:: newspaper.source.Feed
+
+
+Exceptions
+----------
+.. autoclass:: newspaper.ArticleException
+
+.. autoclass:: newspaper.ArticleBinaryDataException
diff --git a/docs/user_guide/assets/chatgpt_chat200x75.png b/docs/user_guide/assets/chatgpt_chat200x75.png
diff --git a/docs/user_guide/assets/chatgpt_chat.png → docs/user_guide/assets/chatgpt_chat75x75.png b/docs/user_guide/assets/chatgpt_chat.png → docs/user_guide/assets/chatgpt_chat75x75.png