Skip to content

A web scraper designed to extract articles from the most popular news sources.

Notifications You must be signed in to change notification settings

mateusvictor/scrapews

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

scrapews

A news scraper made in Python using the packages requests and lxml.

from scrapews.scrapers import NewYorkTimes


ny_scraper = NewYorkTimes()

ny_scraper.scrape()
ny_scraper.send_to_server()

print(ny_scraper.data.get('articles'))

Idea

The core ideia of the scrapews scraper is to request the HTML of a news site and extract from it, through XPath expressions, the primary information about an article, such as title, description and url.

Combining with a RESTful API service, the scraper can be used to feed a content agregator app, for example.

Check out the base_scraper class for more understanding of the code.

Instalation

  • First Clone this repo
git clone https://github.com/mateusvictor/scrapews.git
  • Change into the project directory
cd scrapews/
  • Create a Virtualenv in the project directory
python -m venv venv
  • Activate the virtualenv
venv\Scripts\activate.bat
  • Install the project dependencies
pip install -r requirements.txt

About

A web scraper designed to extract articles from the most popular news sources.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages