This project aims to develop a web scraper using Python. The scraper is designed to collect data from competitor websites, news articles, and market research reports. This data will provide valuable market insights, supporting competitive analysis and strategic decision-making.
The Web Scraper systematically gathers data to help understand market trends, competitor strategies, and overall market dynamics. Key benefits include:
- Market Insights: Gaining a deeper understanding of current market trends to identify emerging patterns and industry shifts.
- Competitive Analysis: Collecting data on competitors' offerings, pricing, marketing campaigns, and market positioning.
- Data-Driven Decisions: Empowering the company to make informed and strategic business decisions based on the collected data.
-
Competitor Websites:
- McKinsey & Company: mckinsey.com
- Boston Consulting Group: bcg.com
- Deloitte: deloitte.com
- Data to Scrape: Service offerings, case studies, client testimonials, thought leadership articles, and market insights.
-
Industry News and Reports:
- Bloomberg: bloomberg.com
- Reuters: reuters.com
- Financial Times: ft.com
- Data to Scrape: Latest news articles, market reports, financial analysis, and global economic trends.
-
Market Research Portals:
- Statista: statista.com
- MarketResearch.com: marketresearch.com
- IBISWorld: ibisworld.com
- Data to Scrape: Market research reports, industry analysis, statistics, and trends.
-
Clone the Repository:
git clone https://github.com/intel00000/web_scraper_Hasmo.git cd web_scraper_Hasmo
-
Create a Virtual Environment:
- Windows:
python -m venv venv
- Linux & MacOS:
python3 -m venv venv
- Windows:
-
Activate the Virtual Environment:
- Windows:
.\\venv\\Scripts\\Activate.ps1
- Linux & MacOS:
source ./venv/bin/activate
- Windows:
- Install the Required Packages:
pip install -r requirements.txt
-
Running the Spiders:
-
If you want to enable summary and Google sheet update, obtain the openai API key and Google service account json private key
- Create a .env file in the main folder with content
OPENAI_API_KEY={Your openai API key}
- Download the Google service account private key as json, save to the main folder and rename to credentials.json
- it should have a format like
{ "type": "service_account", "project_id": "", "private_key_id": "", "private_key": "", "client_email": "xxx@developer.gserviceaccount.com", "client_id": "", "auth_uri": "https://accounts.google.com/o/oauth2/auth", "token_uri": "https://oauth2.googleapis.com/token", "auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs", "client_x509_cert_url": "", "universe_domain": "googleapis.com" }
-
Navigate to the desired scraper directory:
- For Deloitte:
cd deloitte_scraper
- For McKinsey:
cd mckinsey_scraper
- For Deloitte:
-
To list all available spiders, run:
scrapy list
-
To run a specific spider, use:
scrapy crawl spider_name
Replace
spider_name
with the name of the spider you wish to run. -
Alternatively, to run all spiders at once, use:
python run_all_spiders.py
-
-
Data Storage:
- The scraped data will be saved in the
data/raw/
directory as CSV or JSON files.
- The scraped data will be saved in the
-
Configuring the Scraper:
- Adjust configuration settings, such as target URLs, data points to extract, and output formats, in the
settings.py
file within the respective scraper directory (deloitte_scraper
ormckinsey_scraper
).
- Adjust configuration settings, such as target URLs, data points to extract, and output formats, in the
web_scraper_Hasmo/
├── .env # add your OPENAI_API_KEY here
├── .gitignore
├── credentials.json # Google GCP service account API
├── README.md # Documentation for the project
├── requirements.txt # List of required Python packages
│
├── data/ # Scraped data
│ └── raw/ # Subdirectory containing raw CSV and JSON data files
│
├── deloitte_scraper/ # Contains the Scrapy project for Deloitte data
│ ├── scrapy.cfg # Scrapy configuration file
│ └── deloitte_scraper/
│ ├── items.py # Scraped items structure
│ ├── middlewares.py # Middlewares for Scrapy
│ ├── pipelines.py # Pipeline for processing scraped data
│ ├── settings.py # Scrapy settings
│ ├── spiders/ # Spiders directory
│
├── mckinsey_scraper/ # Contains the Scrapy project for McKinsey data
│ ├── scrapy.cfg # Scrapy configuration file
│ └── mckinsey_scraper/
│ ├── items.py # Scraped items structure
│ ├── middlewares.py # Middlewares for Scrapy
│ ├── pipelines.py # Pipeline for processing scraped data
│ ├── settings.py # Scrapy settings
│ ├── spiders/ # Spiders directory
│
├── notebooks/
│ ├── bcg_capabilities.ipynb # Notebook for scraping BCG capabilities
│ ├── bcg_industries.ipynb # Notebook for scraping BCG industries
│ ├── bcg_search_results.ipynb # Notebook for scraping BCG search
│ ├── helper_functions.py # Helper functions adapted from Scrapy pipelines
│ └── scrapy.ipynb # Notebook for testing
│
└── scripts/ # Directory containing testing scripts
├── google_sheet_testing.py # Google Sheets pipeline
├── openai_testing.py # OpenAI API pipeline
└── sample_input.json # Sample input JSON file for testing
└── sample_output_with_summaries.json # Sample output JSON with generated summaries
- data/: Directory where the scraped data is saved.
- venv/: Virtual environment directory.
- config.py: Configuration file for the scraper settings.
- requirements.txt: List of required Python packages.
- scraper.py: Main script for scraping data.
- README.md: Project documentation.
For any issues, questions, or contributions, please open an issue or submit a pull request on GitHub.