Using BeautifulSoup to gather article contents from CNN exporting into a CSV file.
Author: Faris Durrani
GitHub: https://github.com/farisdurrani/CNNWebScraper
Originally created to develop an ML-based bias analyzer on popular news websites for AGORAx (plans delayed indefinitely), this Python script uses BeautifulSoup to recursively scrape all text articles accessible from the CNN Site Map of the selected years. See CNN Site Map 2016 for example. You can enable multiprocessing for much faster scraping at the expense of computer resources. Results are exported to a CSV with a random name.
For each article, the following are retrieved:
timestamp
: Date of publication (yyyy-mm-dd)webUrl
: URL of articleheadline
: headline of articlesectionName
: section of articlesite
: Hardcoded toCNN
bodyContent
: Full text of article body, HTML-sanitized by BeautifulSoup, truncated to the first 31,500 characters (~5,000 words)article_length
: Count of characters of full article body before truncationauthor_name
: Author name of article
- Python 3.10
- Install the required packages in
requirements.txt
- Optional: Modify search options to scrape from https://www.cnn.com/,
modifying appropriately in
cnn_scraper.py
:SELECTED_DATES
- calendar dates (01
,02
, ...,31
) of articles to scrapeSELECTED_MONTHS
- calendar months (01
,02
, ...,12
) of articles to scrape. Can also beset()
to include all monthsSELECTED_YEARS
- calendar years (2010
,2009
,2022
, ...) of articles to scrape. Can also beset()
to include all monthsSELECTED_TOPICS
- topics to scrape. See CNN Site Map 2016 for article section examplesOUTPUT_FILENAME
- output filenameGET_EVERY_X_ARTICLE_PER_MONTH_TOPIC
-1
to get all articles,2
to get every other article only, etc.USE_MULTIPROCESSING
-True
to enable very fast scraping, at risk of crashing the machine for insufficient resources.False
otherwise
- Run
python cnn_scraper.py
and see the output in /outputs by default. The reader may see a sample output as described below.
See sample output in cnn_articles-2022-3409.csv for scraped US articles in January 2022. Note that opening the file in Excel automatically formats the date.
Sometimes, the app will not get any articles and will output an empty CSV after one or two minutes of executing. This could be due to network restrictions after excessive scraping. Try again. The app prints out logs in the terminal as it is scraping articles. Once you see the logs, you shouldn't face any network restrictions anymore for this execution
CNNWebScraper is MIT licensed, as found in the LICENSE file.
CNNWebScraper documentation is Creative Commons licensed, as found in the LICENSE-docs file.