Scripts to scrape contents (text and images) from www.amsterdam.nl and process the HTML into clean text files.
These scripts are designed to scrape and process the contents of the Amsterdam.nl website, extracting text and images for analysis and archival purposes. The project makes use of asynchronous requests to efficiently handle multiple pages and resources.
src
scraper codebase
-
Clone this repository:
git clone https://github.com/Amsterdam-AI-Team/amsterdam-nl-website-scraper.git
-
Install all dependencies:
pip install -r requirements.txt
The code has been tested with Python 3.10.0 on Linux/MacOS/Windows.
First, navigate to the source directory:
cd src
Second, run the scrape_amsterdam_nl.py
script to scrape HTML pages and images from the Amsterdam.nl website:
python3 scrape_amsterdam_nl.py
This will download and save all HTML pages and images from the specified URLs into designated directories.
Note: re-running this script also goes over the failed_html.txt file to retry failed htmls in previous iterations.
Third, after scraping, run the html_to_txt.py
script to convert the downloaded HTML pages into clean text files:
python3 html_to_txt.py
This will process the HTML files, extracting the main content and saving it as text files.
Feel free to help out! Open an issue, submit a PR or contact us.
This repository was created by Amsterdam Intelligence for the City of Amsterdam.
This project is licensed under the terms of the European Union Public License 1.2 (EUPL-1.2).