This is a fork of the original IMDB Scraper repo.
This is a Scrapy project which can be used to crawl IMDB website to scrape movies' information and then store the data in json
format or/and save them in an elasticsearch index.
You can set the search queries in a json config file config/queries.json
. You can get your own query from here: imdb.com/search/title.
You can store scraped info in elasticsearch, just enable the pipeline in the ITEM_PIPELINE
dict in config/scrapy.py
(enabled by default) and set the following env vars:
ES_HOST, ES_PORT, ES_USERNAME, ES_SECRET, ES_INDEX
If you enable the FEED_URI
and FEED_FORMAT
settings in config/scrapy.py
, data will be stored in json
file named movie.json
located at IMDB-Scraper/imdb-scraper/data/movie.json
.
- Clone the repo and navigate into
IMDB-Scraper
folder.
$ git clone https://github.com/dojutsu-user/IMDB-Scraper.git
$ cd IMDB-Scraper/
- Create and activate a virtual environment.
(IMDB-Scraper) $ pipenv shell
- Install all dependencies.
(IMDB-Scraper) $ pipenv install
- Navigate into
imdb_scraper
folder.
(IMDB-Scraper) $ cd imdb_scraper/
- Start the crawler.
(IMDB-Scraper) $ scrapy crawl movie
The project and the obtained dataset is intended only for educational purpose. It is completely open source and has no value intentions to commercialise complete or any part of the same. The developer is on no part the owner of any resources used and does not claim to hold the permissions to use the project.