This Python-based web scraper leverages Selenium to automate the extraction of job postings from Indeed.com. It's an ideal tool for developers, data scientists, and recruiters looking to gather job data efficiently. The scraper stores job details in a MongoDB database and exports the data to a CSV file for easy analysis.
- Targeted Job Scraping: Customize your search criteria with specific keywords and locations to gather the most relevant job postings.
- Database & CSV Integration: Effortlessly save scraped data into MongoDB and export it to CSV for easy access and analysis.
- Smart Filtering: Automatically filter job postings based on the latest posting dates (last 14 days).
- Robust Retry Mechanism: Utilize built-in retries to handle CAPTCHA challenges and minimize scraping interruptions.
- Python 3.7 or higher
- Google Chrome
- MongoDB
- ChromeDriver
Follow these steps to set up and run the Indeed Job Post Scraper on your local machine:
-
Clone the repository:
git clone https://github.com/namdharayush/Indeed-Job-Post-Scraper-with-Selenium.git
-
Install the required Python packages:
pip install -r requirements.txt
-
Set up MongoDB:
- Ensure MongoDB is installed and running on your local machine.
- Modify the
Indeed_Mongo
class inindeed_mongo.py
if needed to match your MongoDB connection settings.
-
Download ChromeDriver:
- Download the ChromeDriver version compatible with your Chrome browser from here.
- Place the
chromedriver
executable in your system's PATH.
-
Run the scraper:
python indeed.py
-
Modify Keywords and Locations:
- In the
scrape
method, modify thekeywords
dictionary to include your desired job titles and locations.
- In the
-
CSV Output:
- The scraped data will be saved to
indeed.csv
in the project directory.
- The scraped data will be saved to
The scraper saves all job postings to a MongoDB database named indeed
under the jobs
collection. The Indeed_Mongo
class manages all MongoDB operations, including inserting job data and clearing old postings (14 days or older).
- User-Agent:
- Customize the user-agent in the
IndeedScraper
class using thefake_useragent
library (commented out in the code).
- Customize the user-agent in the
- Retry Mechanism:
- Adjust the retry logic in the
all_jobs_for_while
method to suit your needs.
- Adjust the retry logic in the
- CAPTCHA Handling:
- If the scraper encounters CAPTCHA blocks, you can modify or add retry logic to handle it more effectively.
- No Jobs Found:
- If no jobs are found, ensure that the XPath selectors are current, as Indeed's website structure may change.