Skip to content

Juillermo/Spiderman

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Spiderman

Spiders are invading El Mundo! You'd better hide, or they'll take your info.

That's basically what's going on. I've built some basic spiders with the Scrapy library for python and got them crawl through the different sections of the Spanish digital newspaper 'El Mundo', getting into each article and taking both the title and body. These are then put into a pipeline, which sends the items to an ElasticSearch server (index: 'el-mundo', type: 'Article'), accesed through elasticsearch-py.

Calling a new crawl operation can be done with:

$ scrapy crawl mundo

More info about the project in the following link: https://www.youtube.com/watch?v=wLg318iEUPs

About

Simple project for web-crawling with Scrapy

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages