This project is aimed at scraping the IMDb Top 250 movies list, extracting various details about each movie, and exporting the data into different formats such as JSON, Excel, XML, CSV, and SQLite database. Additionally, it is automated to run daily for scheduled updates.
This project is aimed at scraping the IMDb Top 250 movies list, extracting various details about each movie, and exporting the data into different formats such as JSON, Excel, XML, CSV, and SQLite database.
The IMDb scraper is implemented in Python and utilizes the following libraries:
selenium
for web scrapingBeautifulSoup
for HTML parsingpandas
for data manipulation and exporting to Excelxml.etree.ElementTree
for exporting data to XMLsqlite3
for SQLite database operations
The scraper extracts details such as movie title, year, length, image URL, rating, votes, URL, story, genres, directors, writers, stars, popularity, etc., for each movie in the IMDb Top 250 list.
Before running the IMDb scraper, ensure you have the required dependencies installed. You can install them using pip with the provided requirements.txt
file.
- Clone this repository to your local machine.
- Navigate to the project directory.
- Install dependencies using the following command:
pip install -r requirements.txt
To use the IMDb scraper:
- Clone this repository to your local machine.
- Navigate to the project directory.
- Run the scraper using the following command:
python scraper.py
The scraper will extract data from the IMDb Top 250 list, process it, and export it to JSON, Excel, XML, CSV, and SQLite database formats in the output_data folder.
The scraper can be automated to run daily at a specified time using the schedule
library. The automate.py
script schedules the scraper to run every day at 6:30 AM. Adjust the schedule timing in the script if needed.
To run the automation script:
python scraper.py