A scraper for ratsinfo.leipzig.de.
- Python 3.8
- pyenv (optional)
Build the docker image
docker build -t codeforleipzig/allris-scraper:latest .
Run the docker container
docker run -v $(pwd)/data:/app/data --rm codeforleipzig/allris-scraper
It is recommended to use a virtual environment in order to isolate libraries used in this project from the environment of your operating system. To do so, run the following in the project directory:
# create the virtual environment in the project directory; do this once
python3 -m venv venv
# activate the environment; do this before working with the scraper
source venv/bin/activate
# install the required libraries
pip3 install -r requirements.txt
To run the scraper using python:
python3 ./1_read_paper_json.py --page_from 1 --page_to 1000 --modified_to 2023-04-27 --modified_from 2023-04-19
python3 ./2_download_pdfs.py
python3 ./3_txt_extraction.py
python3 ./4_srm_import.py
The scraper writes its output to the data
directory. One file per scraping session is written, the convention for the filename is <OParl object type>_<current timestamp>.jl
. For example, when scraping papers: paper_2020-06-19T10-19-16.jl
.
The output is a feed in JSONLines format, which means one scraped JSON document per line. For inspecting the data, the jq is useful and can be used line this:
# all documents in the file
cat path/to/file | jq .
# only the first document
head -n1 path/to/file | jq .
The method download_pdfs() in the leipzig.py file downloads all PDFs, linked in the the JSONLines files and saves them in data/pdfs
. Files that are already saved in the folder will not be downloaded.
From the PDF files, TXT files can be generated with the extract_text_from_pdfs_recursively() method in txt_extraction.py, using Tika. The TXTs will be saved to data/txts
. Files that are already saved in the folder will not be extracted.
Scrapy allows for configuration on various levels. General configuration can be found in allris/settings.py
. For the purposes of this project, relevant values are overridden in leipzig.py
. Per default, it is configured towards development needs. Specifically, aggressive caching is enabled (HTTP_CACHE_ENABLED
) and the number of scraped pages is limited (CLOSESPIDER_PAGECOUNT
).
Prerequisite: leipzig.py scraper has been run and downloaded files to data/pdfs.
Run
python3 ./txt_extraction.py
to extract the texts from the PDFs. Files will be created under data/txts.
Prerequisite: txt_extraction.py has been run.
Run
python3 ./nlp.py
to join those text files as rows into a CSV file. That is created as data/data.csv. This file can be used for further NLP processing.
nlp.py provides a method read_txts_into_dataframe() to read all TXT files in data/txts
into a pandas dataframe and a method write_df_to_csv() to save this dataframe in csv format as data.csv
in the data
folder.
To make the obtained documents more accessible for users interested in certain topics, a topic modeling has been run on the extracted documents with the R software tidyToPān. The obtained model will be used later on for e.g. a search function.
python -m spacy download de_core_news_sm