Skip to content

Latest commit

 

History

History
81 lines (46 loc) · 2.16 KB

README.md

File metadata and controls

81 lines (46 loc) · 2.16 KB

Amsterdam.nl Web Scraper

Scripts to scrape contents (text and images) from www.amsterdam.nl and process the HTML into clean text files.

amsterdam.nl homescreen

Background

These scripts are designed to scrape and process the contents of the Amsterdam.nl website, extracting text and images for analysis and archival purposes. The project makes use of asynchronous requests to efficiently handle multiple pages and resources.

Folder Structure

  • src scraper codebase

Installation

  1. Clone this repository:

    git clone https://github.com/Amsterdam-AI-Team/amsterdam-nl-website-scraper.git
  2. Install all dependencies:

    pip install -r requirements.txt

The code has been tested with Python 3.10.0 on Linux/MacOS/Windows.

Usage

Step 1: Navigate to scripts

First, navigate to the source directory:

cd src

Step 2: Scrape HTML and Images

Second, run the scrape_amsterdam_nl.py script to scrape HTML pages and images from the Amsterdam.nl website:

python3 scrape_amsterdam_nl.py

This will download and save all HTML pages and images from the specified URLs into designated directories.

Note: re-running this script also goes over the failed_html.txt file to retry failed htmls in previous iterations.

Step 3: Convert HTML to Text

Third, after scraping, run the html_to_txt.py script to convert the downloaded HTML pages into clean text files:

python3 html_to_txt.py

This will process the HTML files, extracting the main content and saving it as text files.

Contributing

Feel free to help out! Open an issue, submit a PR or contact us.

Acknowledgements

This repository was created by Amsterdam Intelligence for the City of Amsterdam.

License

This project is licensed under the terms of the European Union Public License 1.2 (EUPL-1.2).