Amsterdam.nl Web Scraper

Scripts to scrape contents (text and images) from www.amsterdam.nl and process the HTML into clean text files.

Background

These scripts are designed to scrape and process the contents of the Amsterdam.nl website, extracting text and images for analysis and archival purposes. The project makes use of asynchronous requests to efficiently handle multiple pages and resources.

Folder Structure

src scraper codebase

Installation

Clone this repository:

git clone https://github.com/Amsterdam-AI-Team/amsterdam-nl-website-scraper.git

Install all dependencies:
```
pip install -r requirements.txt
```

The code has been tested with Python 3.10.0 on Linux/MacOS/Windows.

Usage

Step 1: Navigate to scripts

First, navigate to the source directory:

cd src

Step 2: Scrape HTML and Images

Second, run the scrape_amsterdam_nl.py script to scrape HTML pages and images from the Amsterdam.nl website:

python3 scrape_amsterdam_nl.py

This will download and save all HTML pages and images from the specified URLs into designated directories.

Note: re-running this script also goes over the failed_html.txt file to retry failed htmls in previous iterations.

Step 3: Convert HTML to Text

Third, after scraping, run the html_to_txt.py script to convert the downloaded HTML pages into clean text files:

python3 html_to_txt.py

This will process the HTML files, extracting the main content and saving it as text files.

Contributing

Feel free to help out! Open an issue, submit a PR or contact us.

Acknowledgements

This repository was created by Amsterdam Intelligence for the City of Amsterdam.

License

This project is licensed under the terms of the European Union Public License 1.2 (EUPL-1.2).

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
media		media
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Amsterdam.nl Web Scraper

Background

Folder Structure

Installation

Usage

Step 1: Navigate to scripts

Step 2: Scrape HTML and Images

Step 3: Convert HTML to Text

Contributing

Acknowledgements

License

About

Releases

Packages

Languages

License

Amsterdam-AI-Team/amsterdam-nl-website-scraper

Folders and files

Latest commit

History

Repository files navigation

Amsterdam.nl Web Scraper

Background

Folder Structure

Installation

Usage

Step 1: Navigate to scripts

Step 2: Scrape HTML and Images

Step 3: Convert HTML to Text

Contributing

Acknowledgements

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages