Consulting Web Scraper

This project aims to develop a web scraper using Python. The scraper is designed to collect data from competitor websites, news articles, and market research reports. This data will provide valuable market insights, supporting competitive analysis and strategic decision-making.

Project Description

The Web Scraper systematically gathers data to help understand market trends, competitor strategies, and overall market dynamics. Key benefits include:

Market Insights: Gaining a deeper understanding of current market trends to identify emerging patterns and industry shifts.
Competitive Analysis: Collecting data on competitors' offerings, pricing, marketing campaigns, and market positioning.
Data-Driven Decisions: Empowering the company to make informed and strategic business decisions based on the collected data.

Focus Areas

Competitor Websites:
- McKinsey & Company: mckinsey.com
- Boston Consulting Group: bcg.com
- Deloitte: deloitte.com
- Data to Scrape: Service offerings, case studies, client testimonials, thought leadership articles, and market insights.
Industry News and Reports:
- Bloomberg: bloomberg.com
- Reuters: reuters.com
- Financial Times: ft.com
- Data to Scrape: Latest news articles, market reports, financial analysis, and global economic trends.
Market Research Portals:
- Statista: statista.com
- MarketResearch.com: marketresearch.com
- IBISWorld: ibisworld.com
- Data to Scrape: Market research reports, industry analysis, statistics, and trends.

Setup Instructions

Virtual Environment Setup

Clone the Repository:

git clone https://github.com/intel00000/web_scraper_Hasmo.git
cd web_scraper_Hasmo

Create a Virtual Environment:
- Windows:
```
python -m venv venv
```
- Linux & MacOS:
```
python3 -m venv venv
```

Activate the Virtual Environment:

Windows:
```
.\\venv\\Scripts\\Activate.ps1
```
Linux & MacOS:
```
source ./venv/bin/activate
```

Installing Required Libraries

Install the Required Packages:
```
pip install -r requirements.txt
```

Usage

Running the Spiders:
- If you want to enable summary and Google sheet update, obtain the openai API key and Google service account json private key
  - Create a .env file in the main folder with content
```
OPENAI_API_KEY={Your openai API key}
```
  - Download the Google service account private key as json, save to the main folder and rename to credentials.json
  - it should have a format like
```
 {
 "type": "service_account",
 "project_id": "",
 "private_key_id": "",
 "private_key": "",
 "client_email": "xxx@developer.gserviceaccount.com",
 "client_id": "",
 "auth_uri": "https://accounts.google.com/o/oauth2/auth",
 "token_uri": "https://oauth2.googleapis.com/token",
 "auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs",
 "client_x509_cert_url": "",
 "universe_domain": "googleapis.com"
 }
```
- Navigate to the desired scraper directory:
  - For Deloitte: cd deloitte_scraper
  - For McKinsey: cd mckinsey_scraper
- To list all available spiders, run:
```
scrapy list
```
- To run a specific spider, use:
```
scrapy crawl spider_name
```
  Replace spider_name with the name of the spider you wish to run.
- Alternatively, to run all spiders at once, use:
```
python run_all_spiders.py
```
Data Storage:
- The scraped data will be saved in the data/raw/ directory as CSV or JSON files.
Configuring the Scraper:
- Adjust configuration settings, such as target URLs, data points to extract, and output formats, in the settings.py file within the respective scraper directory (deloitte_scraper or mckinsey_scraper).

Project Structure

web_scraper_Hasmo/
├── .env                             # add your OPENAI_API_KEY here
├── .gitignore
├── credentials.json                 # Google GCP service account API
├── README.md                        # Documentation for the project
├── requirements.txt                 # List of required Python packages
│
├── data/                            # Scraped data
│   └── raw/                         # Subdirectory containing raw CSV and JSON data files
│
├── deloitte_scraper/                # Contains the Scrapy project for Deloitte data
│   ├── scrapy.cfg                   # Scrapy configuration file
│   └── deloitte_scraper/
│       ├── items.py                 # Scraped items structure
│       ├── middlewares.py           # Middlewares for Scrapy
│       ├── pipelines.py             # Pipeline for processing scraped data
│       ├── settings.py              # Scrapy settings
│       ├── spiders/                 # Spiders directory
│
├── mckinsey_scraper/                # Contains the Scrapy project for McKinsey data
│   ├── scrapy.cfg                   # Scrapy configuration file
│   └── mckinsey_scraper/
│       ├── items.py                 # Scraped items structure
│       ├── middlewares.py           # Middlewares for Scrapy
│       ├── pipelines.py             # Pipeline for processing scraped data
│       ├── settings.py              # Scrapy settings
│       ├── spiders/                 # Spiders directory
│
├── notebooks/
│   ├── bcg_capabilities.ipynb       # Notebook for scraping BCG capabilities
│   ├── bcg_industries.ipynb         # Notebook for scraping BCG industries
│   ├── bcg_search_results.ipynb     # Notebook for scraping BCG search
│   ├── helper_functions.py          # Helper functions adapted from Scrapy pipelines
│   └── scrapy.ipynb                 # Notebook for testing
│
└── scripts/                         # Directory containing testing scripts
    ├── google_sheet_testing.py      # Google Sheets pipeline
    ├── openai_testing.py            # OpenAI API pipeline
    └── sample_input.json            # Sample input JSON file for testing
    └── sample_output_with_summaries.json  # Sample output JSON with generated summaries

data/: Directory where the scraped data is saved.
venv/: Virtual environment directory.
config.py: Configuration file for the scraper settings.
requirements.txt: List of required Python packages.
scraper.py: Main script for scraping data.
README.md: Project documentation.

Contact

For any issues, questions, or contributions, please open an issue or submit a pull request on GitHub.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Consulting Web Scraper

Table of Contents

Project Description

Focus Areas

Setup Instructions

Virtual Environment Setup

Installing Required Libraries

Usage

Project Structure

Contact

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
data/raw		data/raw
deloitte_scraper		deloitte_scraper
mckinsey_scraper		mckinsey_scraper
notebooks		notebooks
scripts		scripts
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

intel00000/web_scraper_Hasmo

Folders and files

Latest commit

History

Repository files navigation

Consulting Web Scraper

Table of Contents

Project Description

Focus Areas

Setup Instructions

Virtual Environment Setup

Installing Required Libraries

Usage

Project Structure

Contact

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages