Amsterdam.nl Web Scraper

Scripts to scrape contents (text and images) from www.amsterdam.nl and process the HTML into clean text files.

Background

These scripts are designed to scrape and process the contents of the Amsterdam.nl website, extracting text and images for analysis and archival purposes. The project makes use of asynchronous requests to efficiently handle multiple pages and resources.

Folder Structure

src scraper codebase

Installation

Clone this repository:

git clone https://github.com/Amsterdam-AI-Team/amsterdam-nl-website-scraper.git

Install all dependencies:
```
pip install -r requirements.txt
```

The code has been tested with Python 3.10.0 on Linux/MacOS/Windows.

Usage

Step 1: Navigate to scripts

First, navigate to the source directory:

cd src

Step 2: Scrape HTML and Images

Second, run the scrape_amsterdam_nl.py script to scrape HTML pages and images from the Amsterdam.nl website:

python3 scrape_amsterdam_nl.py

This will download and save all HTML pages and images from the specified URLs into designated directories.

Note: re-running this script also goes over the failed_html.txt file to retry failed htmls in previous iterations.

Step 3: Convert HTML to Text

Third, after scraping, run the html_to_txt.py script to convert the downloaded HTML pages into clean text files:

python3 html_to_txt.py

This will process the HTML files, extracting the main content and saving it as text files.

Contributing

Feel free to help out! Open an issue, submit a PR or contact us.

Acknowledgements

This repository was created by Amsterdam Intelligence for the City of Amsterdam.

License

This project is licensed under the terms of the European Union Public License 1.2 (EUPL-1.2).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Amsterdam.nl Web Scraper

Background

Folder Structure

Installation

Usage

Step 1: Navigate to scripts

Step 2: Scrape HTML and Images

Step 3: Convert HTML to Text

Contributing

Acknowledgements

License

Files

README.md

Latest commit

History

README.md

File metadata and controls

Amsterdam.nl Web Scraper

Background

Folder Structure

Installation

Usage

Step 1: Navigate to scripts

Step 2: Scrape HTML and Images

Step 3: Convert HTML to Text

Contributing

Acknowledgements

License