Skip to content

Historical disease database (19th-20th century) for municipalities in the Netherlands

License

Notifications You must be signed in to change notification settings

sodascience/disease_database

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

81 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Disease database

Project Status: WIP – Initial development is in progress, but there has not yet been a stable, usable release suitable for the public. GitHub Release

Code to create a historical disease database (19th-20th century) for municipalities in the Netherlands.

Cholera in the Netherlands

Preparation

This project uses pyproject.toml to handle its dependencies. You can install them using pip like so:

pip install .

However, we recommend using uv to manage the environment. First, install uv, then clone / download this repo, then run:

uv sync

this will automatically install the right python version, create a virtual environment, and install the required packages. If you choose not to use uv, you can replace uv run in the code examples in this repo with python.

Note, on macOS, if you encounter error: command 'cmake' failed: No such file or directory, you need to install cmake first. On macOS, run brew install cmake. Similarly, you may have to install apache-arrow separately as well (e.g., on macOS brew install apache-arrow).

Once these dependency issues are solved, run uv sync one more time.

Data extraction (1830-1879)

Between 1830 and 1879, Delpher historical news article data can be downloaded manually from here. The downloaded files, which are zip folders, take up lots of disk space because of inefficient data format.

The src/process_open_archive/extract_article_data.py script extracts the titles and texts from the zip folder for each article. Then, it stores all extracted data as a polars dataframe with three columns article_id, article_title and article_text. Finally, it is saved as a parquet file (article_data_{start_year}_{end_year}.parquet), with a much smaller size under processed_data/texts/from_1830_to_1879/.

With the src/process_open_archive/extract_meta_data.py script, we extract meta information about both the newspapers and the individual articles. This results in two kinds of polars dataframes saved in parquet format under processed_data/metadata/newspapers/from_1830_to_1879 and processed_data/metadata/articles/from_1830_to_1879, respectively.

  1. newspaper_meta_data_{start_year}_{end_year}.parquet includes these columns: newspaper_name, newspaper_location, newspaper_date, newspaper_years_digitalised, newspaper_years_issued, newspaper_language, newspaper_temporal, newspaper_publisher and newspaper_spatial.
  2. article_meta_data_{start_year}_{end_year}.parquet includes these columns: newspaper_id, article_id and article_subject.

Before you run the following script, make sure to put all the Delpher zip files under raw_data/open_archive.

uv run src/process_open_archive/extract_article_data.py
uv run src/process_open_archive/extract_meta_data.py

Then, run

uv run src/process_open_archive/combine_and_chunk.py

to join all the available datasets and create a yearly-chunked series of parquet files in the folder processed_data/combined.

Data harvesting (1880-1940)

After 1880, the data is not public and can only be obtained through the Delpher API:

  1. Obtain an API key (which looks like this df2e02aa-8504-4af2-b3d9-64d107f4479a) from the Royal Library / the Delpher maintainers, then put the API key in the file src/harvest_delpher_api/apikey.txt.
  2. Harvest the data following readme in the delpher api folder: src/harvest_delpher_api/readme.md

Database creation

After the data has been harvested and processed from 1830-1940, the folder processed_data/combined should now be filled with .parquet files. The first record looks like this:

import polars as pl
pl.scan_parquet("processed_data/combined/*.parquet").head(1).collect().glimpse()
$ newspaper_id                 <str> 'ddd:010041217:mpeg21'
$ article_id                   <str> 'ddd:010041217:mpeg21:a0001'
$ article_subject              <str> 'artikel'
$ article_title                <str> None
$ article_text                 <str> 'De GOUVERNEUR der PROVINCIE GELDERLAND ...'
$ newspaper_name               <str> 'Arnhemsche courant'
$ newspaper_location           <str> 'Arnhem'
$ newspaper_date              <date> 1830-01-02
$ newspaper_years_digitalised  <str> '1814 t/m 1850'
$ newspaper_years_issued       <str> '1814-2001'
$ newspaper_language           <str> 'nl'
$ newspaper_temporal           <str> 'Dag'
$ newspaper_publisher          <str> 'C.A. Thieme'
$ newspaper_spatial            <str> 'Regionaal/lokaal'

Step 1: pre-processing / re-partitioning

To make our data processing much faster, we will now process all these files into a hive-partitioned parquet folder, with subfolders for each year. This is done using the following code

uv run src/create_database/preproc.py

After this, the folder processed_data/partitioned will contain differently organized parquet files, but they contain the exact same information.

Step 2: database computation

NB: from this step onwards, we ran this on a linux (ubuntu) machine with >200 cores and 1TB of memory

The next step is to create the actual database we are interested in. There are three inputs for this:

Input Description
raw_data/manual_input/disease_search_terms.xlsx Contains a list of diseases and their regex search definitions
raw_data/manual_input/location_search_Terms.xlsx Contains a list of locations and their regex search definitions
processed_data/partitioned/**/*.parquet Contains the texts of all articles from 1830-1940

The following command will take these inputs, perform the regex searches and output (many) .parquet files to processed_data/database_flat. On our big machine, this takes about 12 hours.

uv run src/create_database/main.py

It may be better to run this in the background without hangups:

nohup uv run src/create_database/main.py &

The resulting data looks approximately like this:

import polars as pl
pl.scan_parquet("processed_data/database_flat/*.parquet").head().collect()
shape: (5, 8)
┌──────┬───────┬────────────┬────────┬────────────┬─────────┬───────────────┬─────────┐
│ year ┆ month ┆ n_location ┆ n_both ┆ location   ┆ cbscode ┆ amsterdamcode ┆ disease │
│ ---  ┆ ---   ┆ ---        ┆ ---    ┆ ---        ┆ ---     ┆ ---           ┆ ---     │
│ i32  ┆ i8    ┆ u32        ┆ u32    ┆ str        ┆ i32     ┆ i32           ┆ str     │
╞══════╪═══════╪════════════╪════════╪════════════╪═════════╪═══════════════╪═════════╡
│ 1834 ┆ 6     ┆ 1          ┆ 0      ┆ Aagtekerke ┆ 1000    ┆ 10531         ┆ typhus  │
│ 1833 ┆ 12    ┆ 3          ┆ 0      ┆ Aagtekerke ┆ 1000    ┆ 10531         ┆ typhus  │
│ 1834 ┆ 9     ┆ 1          ┆ 0      ┆ Aagtekerke ┆ 1000    ┆ 10531         ┆ typhus  │
│ 1832 ┆ 5     ┆ 1          ┆ 0      ┆ Aagtekerke ┆ 1000    ┆ 10531         ┆ typhus  │
│ 1831 ┆ 4     ┆ 2          ┆ 0      ┆ Aagtekerke ┆ 1000    ┆ 10531         ┆ typhus  │
└──────┴───────┴────────────┴────────┴────────────┴─────────┴───────────────┴─────────┘

In this format, the column n_location means the number of detected mentions of the location / municipality, and the column n_both represents the number of disease mentions within this set of articles mentioning the location.

Step 3: post-processing

The last step is to organise the data (e.g., sorting by date), compute the normalized mentions, and add uncertainty intervals (through Jeffrey's interval)

uv run src/create_database/postproc.py

The resulting data folder processed_data/database looks like this:

database/
├── disease=cholera/
│   └── 00000000.parquet
├── disease=diphteria/
│   └── 00000000.parquet
├── disease=dysentery/
│   └── 00000000.parquet
├── disease=influenza/
│   └── 00000000.parquet
├── disease=malaria/
│   └── 00000000.parquet
├── disease=measles/
│   └── 00000000.parquet
├── disease=scarletfever/
│   └── 00000000.parquet
├── disease=smallpox/
│   └── 00000000.parquet
├── disease=tuberculosis/
│   └── 00000000.parquet
├── disease=typhus/
│   └── 00000000.parquet

Now, for example, the typhus mentions in 1838 look like this:

import polars as pl
lf = pl.scan_parquet("processed_data/database/**/*.parquet")
lf.filter(pl.col("disease") == "typhus", pl.col("year") == 1838).head().collect()
┌─────────┬──────┬───────┬───────────────┬─────────┬───────────────┬─────────────────────┬───────┬──────────┬────────────┬────────┐
│ disease ┆ year ┆ month ┆ location      ┆ cbscode ┆ amsterdamcode ┆ normalized_mentions ┆ lower ┆ upper    ┆ n_location ┆ n_both │
│ ---     ┆ ---  ┆ ---   ┆ ---           ┆ ---     ┆ ---           ┆ ---                 ┆ ---   ┆ ---      ┆ ---        ┆ ---    │
│ str     ┆ i32  ┆ i8    ┆ str           ┆ i32     ┆ i32           ┆ f64                 ┆ f64   ┆ f64      ┆ u32        ┆ u32    │
╞═════════╪══════╪═══════╪═══════════════╪═════════╪═══════════════╪═════════════════════╪═══════╪══════════╪════════════╪════════╡
│ typhus  ┆ 1835 ┆ 1     ┆ Aalsmeer      ┆ 358     ┆ 11264         ┆ 0.0                 ┆ 0.0   ┆ 0.330389 ┆ 6          ┆ 0      │
│ typhus  ┆ 1835 ┆ 1     ┆ Aalst         ┆ 1001    ┆ 11423         ┆ 0.0                 ┆ 0.0   ┆ 0.444763 ┆ 4          ┆ 0      │
│ typhus  ┆ 1835 ┆ 1     ┆ Aalten        ┆ 197     ┆ 11046         ┆ 0.0                 ┆ 0.0   ┆ 0.853254 ┆ 1          ┆ 0      │
│ typhus  ┆ 1835 ┆ 1     ┆ Aarlanderveen ┆ 1002    ┆ 11242         ┆ 0.0                 ┆ 0.0   ┆ 0.330389 ┆ 6          ┆ 0      │
│ typhus  ┆ 1835 ┆ 1     ┆ Aduard        ┆ 2       ┆ 10999         ┆ 0.0                 ┆ 0.0   ┆ 0.262217 ┆ 8          ┆ 0      │
│ typhus  ┆ 1835 ┆ 1     ┆ Akersloot     ┆ 360     ┆ 10346         ┆ 0.0                 ┆ 0.0   ┆ 0.666822 ┆ 2          ┆ 0      │
│ typhus  ┆ 1835 ┆ 1     ┆ Alblasserdam  ┆ 482     ┆ 11327         ┆ 0.0                 ┆ 0.0   ┆ 0.666822 ┆ 2          ┆ 0      │
│ typhus  ┆ 1835 ┆ 1     ┆ Alkmaar       ┆ 361     ┆ 10527         ┆ 0.0                 ┆ 0.0   ┆ 0.045246 ┆ 54         ┆ 0      │
│ typhus  ┆ 1835 ┆ 1     ┆ Alphen        ┆ 1008    ┆ 10517         ┆ 0.0                 ┆ 0.0   ┆ 0.11147  ┆ 21         ┆ 0      │
│ typhus  ┆ 1835 ┆ 1     ┆ Ambt Delden   ┆ 142     ┆ 11400         ┆ 0.0                 ┆ 0.0   ┆ 0.444763 ┆ 4          ┆ 0      │
└─────────┴──────┴───────┴───────────────┴─────────┴───────────────┴─────────────────────┴───────┴──────────┴────────────┴────────┘

Data analysis

For a basic analysis after the database has been created, take a look at the file src/analysis/query_db.py.

For more in-depth analysis and usage scripts, take a look at our analysis repository: disease_database_analysis.

Contact

SoDa logo

This project is developed and maintained by the ODISSEI Social Data Science (SoDa) team.

Do you have questions, suggestions, or remarks? File an issue in the issue tracker or feel free to contact the team at odissei-soda.nl