CNNWebScraper

Using BeautifulSoup to gather article contents from CNN exporting into a CSV file.

Author: Faris Durrani
GitHub: https://github.com/farisdurrani/CNNWebScraper

Description

Originally created to develop an ML-based bias analyzer on popular news websites for AGORAx (plans delayed indefinitely), this Python script uses BeautifulSoup to recursively scrape all text articles accessible from the CNN Site Map of the selected years. See CNN Site Map 2016 for example. You can enable multiprocessing for much faster scraping at the expense of computer resources. Results are exported to a CSV with a random name.

For each article, the following are retrieved:

timestamp: Date of publication (yyyy-mm-dd)
webUrl: URL of article
headline: headline of article
sectionName: section of article
site: Hardcoded to CNN
bodyContent: Full text of article body, HTML-sanitized by BeautifulSoup, truncated to the first 31,500 characters (~5,000 words)
article_length: Count of characters of full article body before truncation
author_name: Author name of article

How to Use

Requirements

Python 3.10
Install the required packages in requirements.txt

Usage

Optional: Modify search options to scrape from https://www.cnn.com/, modifying appropriately in cnn_scraper.py:
- SELECTED_DATES - calendar dates (01, 02, ..., 31) of articles to scrape
- SELECTED_MONTHS - calendar months (01, 02, ..., 12) of articles to scrape. Can also be set() to include all months
- SELECTED_YEARS - calendar years (2010, 2009, 2022, ...) of articles to scrape. Can also be set() to include all months
- SELECTED_TOPICS - topics to scrape. See CNN Site Map 2016 for article section examples
- OUTPUT_FILENAME - output filename
- GET_EVERY_X_ARTICLE_PER_MONTH_TOPIC - 1 to get all articles, 2 to get every other article only, etc.
- USE_MULTIPROCESSING - True to enable very fast scraping, at risk of crashing the machine for insufficient resources. False otherwise
Run python cnn_scraper.py and see the output in /outputs by default. The reader may see a sample output as described below.

Output

See sample output in cnn_articles-2022-3409.csv for scraped US articles in January 2022. Note that opening the file in Excel automatically formats the date.

Troubleshooting

Sometimes, the app will not get any articles and will output an empty CSV after one or two minutes of executing. This could be due to network restrictions after excessive scraping. Try again. The app prints out logs in the terminal as it is scraping articles. Once you see the logs, you shouldn't face any network restrictions anymore for this execution

License

CNNWebScraper is MIT licensed, as found in the LICENSE file.

CNNWebScraper documentation is Creative Commons licensed, as found in the LICENSE-docs file.

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
.github		.github
outputs/output_samples		outputs/output_samples
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
cnn_scraper.py		cnn_scraper.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CNNWebScraper

Description

How to Use

Requirements

Usage

Output

Troubleshooting

License

About

Releases

Packages

Languages

License

farisdurrani/CNNWebScraper

Folders and files

Latest commit

History

Repository files navigation

CNNWebScraper

Description

How to Use

Requirements

Usage

Output

Troubleshooting

License

About

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages