Skip to content

davidmz/anonymize-db

Repository files navigation

FreeFeed database anonymizer

This tool anonymizes the FreeFeed database dump.

It takes the pg_dump result (in plain SQL format) and anonymizes it in the following way:

  • Some (configurable) tables are keeping untouched.
  • Some (configurable) tables are cleaning up and becoming empty.
  • In all other tables:
    • Some or all (configurable) UUIDs are irreversibly encrypting. They remain correct UUIDs, however it is impossible to restore the original UUIDs back.
    • Some (configurable) column values are replacing by lorem ipsum-like content. This applies to texts, names, emails etc. There a several rules for different types of content.
    • Some (configurable) column values are replacing by the predefined constant values.

Build

Use Go 1.14+ to build this program:

go get github.com/davidmz/anonymize-db

Use

Flags of anonymize-db:
  -c, --config string   config file name (JSON)
  -i, --input string    input file name (STDIN by default)
  -o, --output string   output file name (STDOUT by default)

The only required option is the path to config file. This file contains specific anonymizing rules for different database tables. There are two predefined configuration files in this repository:

  • config.anon.json — fully anonymizes database with all UUIDs and user data;
  • config.candy.json — anonymizes database for the stage server (candy.freefeed.net): usernames, hashed passwords and UUIDs are not changes (so users can sign in with their freefeed credentials), but all texts and attachments are anonymizes;

The simplest usage is:

pg_dump -d freefeed -U freefeed -F p | anonymize-db -c config.anon.json > anonymized.sql

Configure

The config file contains specific anonymizing rules for different database tables. The file contains a mapping between the full-qualified table names (including schema names, like "public.users") and instructions.

The top-level encryptUUIDs boolean value defines whether the all database UUIDs should be encrypted.

The { "keep": true } per-table instruction means that the table must be keeping untouched.

The { "clean": true } per-table instruction means that the table must be truncated and become empty.

The { "columns": {...} } per-table instruction defines the per-column anonymizing rules. Any column that are not listed here (and any tables that are not listed in the file) will be processed in general way: only the UUIDs will be encrypted if the top-level encryptUUIDs is true.

The per-column rules are:

  • text — long texts like posts and comments bodies;
  • shorttext — short texts like names and titles;
  • uniqword — unique words like usernames (the same username always converted to the same word);
  • uniqemail — unique email addresses;
  • shortid — unique 8-characters hexadecimal IDs (the same source always converted to the same short ID);
  • uuids — encrypt any UUIDs in column (if the global encryptUUIDs parameter isn't set);
  • set:VALUE — constant value to write to the column.

The set: value format is the same as the PostgreSQL COPY format, so to set a NULL value use set:\N rule (and add the second backslash for JSON).

Only non-NULL and non-empty columns processed.

Provided configurations

The provided configs are compatible with FreeFeed DB scheme with the last migration 20240831105941_attachments_ext_index.ts.

The hashed_password value in the public.users table (config.anon.json) is a hashed 'tester' string (so all users in resulted database will have the same 'tester' password).

Both configs are cleans up the posts and comments body_tsvector fields. You must rebuild the search index after the anonymization if you want to use search.

Debug

Set DEBUG=anon environment variable to see some debug information during the process.