MeasuringFairnessWithBiasedData

This repository contains the code and some artifacts to reproduce the paper:

Sarah Schröder, Alexander Schulz, Ivan Tarakanov, Robert Feldhans and Barbara Hammer. Measuring fairness with biased data: A case study on the effects of unsupervised data in fairness evaluation. Accepted at IWANN 2023 and to be published in the LNCS proceedings.

Requirements

We provide our conda environment in env.yml. The following packages are necessary to run the notebooks:

numpy
scipy
pandas
matplotlib
seaborn
pytorch
scikit-learn

In addition we used a custom wrapper for huggingface embeddings.

Dataset

The paper investigates the quality of the BIOS dataset. The dataset is not published due to copyright and data protection reasons. However, a crawler to download the dataset can be found here. The results from our review can be found in data/BIOS_LABELS.csv. It contains no sensitive information such as the biographies themselves or names, but only the original ids, labels and paths as well as our review results to allow people to merge it with the BIOS dataset.

The following columns from the original dataset were kept:

path
gender
start_pos
URI
auto_raw_title (renamed from raw_title)
auto_title (renamed from title)

These columns refer to our review:

review (1 if the samples was reviewed, otherwise 0)
raw_titles (a list of raw_titles determined by the annotator)
titles (a list of titles derived from the raw titles and the title_lookup.json, including only occupations used in the paper)
valid (1 if the annotator decided the sample was a valid biography (see paper)) otherwise 0 (or -1 for samples not reviewed))
style_valid (1 if the annotator decided the text style matched a biography, in addition to being valid, otherwise 0 (or -1 if not reviewed/ invalid))

Reproducing the paper

Finding candidate samples for review

Use filter_dataset.ipynb to obtain a selection of approx. 20.000 potentially problematic samples we used for the review.

Add review labels to dataset

If you downloaded the BIOS dataset yourself, you may use add_changes_to_dataset.ipynb to add our labels to the dataset. Move the downloaded BIOS.pkl to the data/ directory or changes the path in the notebook.

Experiments from the paper

To reproduce our experiments, you first need to download the dataset yourself and merge our labels (see above). Then you can use data_review.ipynb to run our experiments. This includes training 30 BERT models, so it may take a while.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
data		data
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
add_changes_to_dataset.ipynb		add_changes_to_dataset.ipynb
data_review.ipynb		data_review.ipynb
env.yml		env.yml
filter_dataset.ipynb		filter_dataset.ipynb
huggingface_patch.ipynb		huggingface_patch.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MeasuringFairnessWithBiasedData

Requirements

Dataset

Reproducing the paper

Finding candidate samples for review

Add review labels to dataset

Experiments from the paper

About

Releases

Packages

Languages

License

HammerLabML/MeasuringFairnessWithBiasedData

Folders and files

Latest commit

History

Repository files navigation

MeasuringFairnessWithBiasedData

Requirements

Dataset

Reproducing the paper

Finding candidate samples for review

Add review labels to dataset

Experiments from the paper

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages