Purpose

The purpose of this repository is to act as a web scraping training ground for those hoping to develop their skills in Python. It was presented by Lorae Stojanovic to the Brookings Institution as part of the Brookings Data Network presentation series on June 20, 2024.

The repository has three main elements:

Instructional slides These slides (produced using Jupyter) walk users through web scraping techniques such as:
- Making network requests
- Parsing HTML code
- Using Selenium
- Discovering hidden APIs
- Rendering JSON data sets
Instructional webpage This webpage, hosted on GitHub, acts as an illustrative example website that users can web scrape directly as they follow along with instructions from the slides.
Sample code The code walks users through:
- Web scraping static content using HTTP requests and parsing the response using the beautifulsoup4 library
- Web scraping dynamic content using the selenium library
- Requesting APIs using the requests library

Getting Started

If you'd simply like to view the educational materials, please navigate to:

Instructional slides: https://lorae.github.io/web-scraping-tutorial/advanced-web-scraping.slides.html

Instructional website: https://lorae.github.io/web-scraping-tutorial

Instructional code: https://lorae.github.io/sample_code

Instructions for running the project

If you'd like to run the source code on your computer, proceed with the steps below.

If you're running the project for the first time

Navigate to the directory where you'd like to save the project:
```
cd your/path/to/your/desired/folder
```

Clone the repository:

git clone https://github.com/lorae/web-scraping-tutorial

Set your working directory in the repository:
```
cd web-scraping-tutorial
```
Install Poetry:

Poetry is a dependency management and packaging tool for Python. Follow the instructions for your operating system to install Poetry. If you already have Poetry installed, you may safely skip this step.
- For bash/zsh (Linux/macOS):
```
curl -sSL https://install.python-poetry.org | python3 -
```
- For Windows PowerShell:
```
(Invoke-WebRequest -Uri https://install.python-poetry.org -UseBasicParsing).Content | python -
```
After installation, ensure that Poetry is added to your PATH. You can verify the installation by running:
```
poetry --version
```
Install Dependencies: Use Poetry to install the project dependencies and set up the virtual environment.
```
poetry install
```
Activate the Poetry virtual environment:
```
poetry shell
```

You are now set up and ready to work on the project!

If you're running the project for any subsequent time

Set your working directory in the repository:
```
cd your/path/to/web-scraping-tutorial
```
Activate the Poetry virtual environment:
```
poetry shell
```

To generate the .html slides using Jupyter:

TODO: explain how to run the .ipynb file

To run the website locally:

TODO

Project Structure

The schematic below illustrates the basic file structure of the project.

web-scraping-tutorial/
│
├── .gitignore
├── README.md # The file you are currently reading
├── advanced-web-scraping.slides.html # Formatted presentation slides
├── index.html # Produces instructional website
│
├── sample_code/ # Sample code for users to scrape the instructional website
│ ├── requests_bs4_sc3raping.py # Sample code for scraping static content
│ ├── selenium_scraping.py # Sample code for scraping dynamic content
│ └── api_requests.py # Sample code for requesting APIs
│
├── slides_content/ # Used to produce the Jupyter notebook generating presentation slides
│ ├── advanced-web-scraping.ipynb # Jupyter notebook producing presentation slides
│ └── images/ # Images used in the slides
│   ├── all-http-requests-screenshot.png
│   ├── brookings-edu-screenshot.png
│   ├── client-server-request-response.png
│   ├── client-server.png
│   └── ...
│
└── web_content/ # Content used in index.html (instructional website)
  └── css/ # Website styling
  │ └── styles.css
  │ 
  └── data/ # Data used in index.html external requests
  │ ├── gdp-data.csv # Used to populate the GDP graph
  │ └── web-scraping-resources.json # Used to populate cards for external resources
  │
  └── images/ # Images used in index.html
    ├── book.jpg
    ├── building_blocks.jpg
    ├── chat_box.jpg
    ├── computer_cloud.jpg
    └── ...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Purpose

Getting Started

Instructions for running the project

If you're running the project for the first time

If you're running the project for any subsequent time

To generate the .html slides using Jupyter:

To run the website locally:

Project Structure

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 95 Commits
sample_code		sample_code
slides_content		slides_content
web_content		web_content
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
advanced-web-scraping.slides.html		advanced-web-scraping.slides.html
index.html		index.html
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

License

lorae/web-scraping-tutorial

Folders and files

Latest commit

History

Repository files navigation

Purpose

Getting Started

Instructions for running the project

If you're running the project for the first time

If you're running the project for any subsequent time

To generate the .html slides using Jupyter:

To run the website locally:

Project Structure

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages