A news scraper made in Python using the packages requests
and lxml
.
from scrapews.scrapers import NewYorkTimes
ny_scraper = NewYorkTimes()
ny_scraper.scrape()
ny_scraper.send_to_server()
print(ny_scraper.data.get('articles'))
The core ideia of the scrapews
scraper is to request the HTML of a news site and extract from it, through XPath expressions, the primary information about an article, such as title, description and url.
Combining with a RESTful API service, the scraper can be used to feed a content agregator app, for example.
Check out the base_scraper
class for more understanding of the code.
- First Clone this repo
git clone https://github.com/mateusvictor/scrapews.git
- Change into the project directory
cd scrapews/
- Create a Virtualenv in the project directory
python -m venv venv
- Activate the virtualenv
venv\Scripts\activate.bat
- Install the project dependencies
pip install -r requirements.txt