- build a simple web scraper that will return the content of a news article when given a specific URL. Some examples of real products which use similar technologies include price-tracking websites and SEO audit tools which may scrape top search results.
Choose one news website - see article examples below for inspiration. Given a specific article URL from the website of your choice, return the title and content of the article to the user.
Examples article URLs:
https://www.nytimes.com/2020/09/02/opinion/remote-learning-coronavirus.html https://www.washingtonpost.com/technology/2020/09/25/privacy-check-blacklight/
https://edition.cnn.com/travel/article/scenic-airport-landings-2020/index.html
For an extra challenge: Parse out information such as the article title, updated date, and byline to return separately to the user.
You can use something similar to this service in command line:
> python scrape_newyorktimes.py news_url
We suggest using a HTTP library like Requests to get the raw HTML file of the URL. Then use a parsing library like Beautiful Soup to parse the content. Alternatively, you can also use a Python scraping tool like Scrapy.
- You can use xPath to select elements if there’s no class or div for the element
- Take note of the Python version you have installed! (reference)
# run scrapy
> scrapy runspider news.py
# create a csv file
> scrapy runspider news.py -o nyt.csv