The purpose of this repository is to act as a web scraping training ground for those hoping to develop their skills in Python. It was presented by Lorae Stojanovic to the Brookings Institution as part of the Brookings Data Network presentation series on June 20, 2024.
The repository has three main elements:
-
Instructional slides These slides (produced using Jupyter) walk users through web scraping techniques such as:
- Making network requests
- Parsing HTML code
- Using Selenium
- Discovering hidden APIs
- Rendering JSON data sets
-
Instructional webpage This webpage, hosted on GitHub, acts as an illustrative example website that users can web scrape directly as they follow along with instructions from the slides.
-
Sample code The code walks users through:
- Web scraping static content using HTTP requests and parsing the response using the
beautifulsoup4
library - Web scraping dynamic content using the
selenium
library - Requesting APIs using the
requests
library
- Web scraping static content using HTTP requests and parsing the response using the
If you'd simply like to view the educational materials, please navigate to:
Instructional slides: https://lorae.github.io/web-scraping-tutorial/advanced-web-scraping.slides.html
Instructional website: https://lorae.github.io/web-scraping-tutorial
Instructional code: https://lorae.github.io/sample_code
If you'd like to run the source code on your computer, proceed with the steps below.
-
Navigate to the directory where you'd like to save the project:
cd your/path/to/your/desired/folder
-
Clone the repository:
git clone https://github.com/lorae/web-scraping-tutorial
-
Set your working directory in the repository:
cd web-scraping-tutorial
-
Install Poetry:
Poetry is a dependency management and packaging tool for Python. Follow the instructions for your operating system to install Poetry. If you already have Poetry installed, you may safely skip this step.
-
For bash/zsh (Linux/macOS):
curl -sSL https://install.python-poetry.org | python3 -
-
For Windows PowerShell:
(Invoke-WebRequest -Uri https://install.python-poetry.org -UseBasicParsing).Content | python -
After installation, ensure that Poetry is added to your PATH. You can verify the installation by running:
poetry --version
-
-
Install Dependencies: Use Poetry to install the project dependencies and set up the virtual environment.
poetry install
-
Activate the Poetry virtual environment:
poetry shell
You are now set up and ready to work on the project!
- Set your working directory in the repository:
cd your/path/to/web-scraping-tutorial
- Activate the Poetry virtual environment:
poetry shell
TODO: explain how to run the .ipynb file
TODO
The schematic below illustrates the basic file structure of the project.
web-scraping-tutorial/
│
├── .gitignore
├── README.md # The file you are currently reading
├── advanced-web-scraping.slides.html # Formatted presentation slides
├── index.html # Produces instructional website
│
├── sample_code/ # Sample code for users to scrape the instructional website
│ ├── requests_bs4_sc3raping.py # Sample code for scraping static content
│ ├── selenium_scraping.py # Sample code for scraping dynamic content
│ └── api_requests.py # Sample code for requesting APIs
│
├── slides_content/ # Used to produce the Jupyter notebook generating presentation slides
│ ├── advanced-web-scraping.ipynb # Jupyter notebook producing presentation slides
│ └── images/ # Images used in the slides
│ ├── all-http-requests-screenshot.png
│ ├── brookings-edu-screenshot.png
│ ├── client-server-request-response.png
│ ├── client-server.png
│ └── ...
│
└── web_content/ # Content used in index.html (instructional website)
└── css/ # Website styling
│ └── styles.css
│
└── data/ # Data used in index.html external requests
│ ├── gdp-data.csv # Used to populate the GDP graph
│ └── web-scraping-resources.json # Used to populate cards for external resources
│
└── images/ # Images used in index.html
├── book.jpg
├── building_blocks.jpg
├── chat_box.jpg
├── computer_cloud.jpg
└── ...