Skip to content

Amsterdam-AI-Team/amsterdam-nl-website-scraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Amsterdam.nl Web Scraper

Scripts to scrape contents (text and images) from www.amsterdam.nl and process the HTML into clean text files.

amsterdam.nl homescreen

Background

These scripts are designed to scrape and process the contents of the Amsterdam.nl website, extracting text and images for analysis and archival purposes. The project makes use of asynchronous requests to efficiently handle multiple pages and resources.

Folder Structure

  • src scraper codebase

Installation

  1. Clone this repository:

    git clone https://github.com/Amsterdam-AI-Team/amsterdam-nl-website-scraper.git
  2. Install all dependencies:

    pip install -r requirements.txt

The code has been tested with Python 3.10.0 on Linux/MacOS/Windows.

Usage

Step 1: Navigate to scripts

First, navigate to the source directory:

cd src

Step 2: Scrape HTML and Images

Second, run the scrape_amsterdam_nl.py script to scrape HTML pages and images from the Amsterdam.nl website:

python3 scrape_amsterdam_nl.py

This will download and save all HTML pages and images from the specified URLs into designated directories.

Note: re-running this script also goes over the failed_html.txt file to retry failed htmls in previous iterations.

Step 3: Convert HTML to Text

Third, after scraping, run the html_to_txt.py script to convert the downloaded HTML pages into clean text files:

python3 html_to_txt.py

This will process the HTML files, extracting the main content and saving it as text files.

Contributing

Feel free to help out! Open an issue, submit a PR or contact us.

Acknowledgements

This repository was created by Amsterdam Intelligence for the City of Amsterdam.

License

This project is licensed under the terms of the European Union Public License 1.2 (EUPL-1.2).

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages