This repository contains the code to reproduce the results of the paper "Bias Silhouette Analysis: Towards Assessing the Quality of Bias Metrics for Word Embedding Models", as presented at the IJCAI 2021 conference. Please find the full reference below:
@InProceedings{spliethoever:2021,
title = {Bias Silhouette Analysis: Towards Assessing the Quality of Bias Metrics for Word Embedding Models},
author = {Maximilian Splieth{\"o}ver and Henning Wachsmuth},
booktitle = {Proceedings of the Thirtieth International Joint Conference on
Artificial Intelligence, {IJCAI-21}}, publisher = {International Joint Conferences on Artificial Intelligence Organization},
editor = {Zhi-Hua Zhou},
pages = {552--559},
year = {2021},
month = {aug},
doi = {10.24963/ijcai.2021/77},
url = {https://doi.org/10.24963/ijcai.2021/77},
}
You can find the paper's supplementary material in the bias-silhouette-analysis-supplementary.pdf
file.
The code in this repository was tested, and the results were created with Python version: 3.8 (Linux) as defined in the Pipfile
.
All required Python packages and their version are defined in the Pipfile
. When using pipenv, install the requirements with:
$ pipenv install
Further, some of the scripts build on the spaCy language model. Thus, it needs to be installed as well.
$ python -m spacy download en_core_web_sm
The study builds on different pre-trained word embedding models. Both the GloVe embedding model and the ConceptNet Numberbatch model will be downloaded at runtime by the embeddings
library (only on the first run; consecutive runs will re-use the downloaded models). Since the library did not support the latter model at the time of evaluation, we ship a custom implementation with this repository.
The central processing and evaluation pipeline is defined and commented in the run-pipeline.sh
file. All critical settings are defined using variables at the beginning of the file and can be adapted if necessary. By default, all generated outputs will be placed in a sub-directory of output/metric-evaluation/
, which is created at run time. When done adapting the variables, simply start the pipeline.
$ bash run_pipeline.sh
This code is built in a way that should make it fairly easy to extend and run with different metrics and word embedding models, as long as those fulfill certain requirements.
- Most importantly, each metric needs its evaluation function. The evaluation function executes the metric evaluation with a given lexicon on a given embedding model. A custom implementation per metric is required since some metrics use a different number of input lexicons or need an input formatted in a certain way. If your metric uses four input lexicons, the evaluation function of the WEAT metric
weat_evaluation
can be used as a guideline (and probably almost entirely re-used). If your metric uses three inputs, the evaluation function of the RNSB metricrnsb_evaluation
can be of help. For the documentation and parameters of this function, please also refer to one of the evaluation function implementations. - Secondly, you need to add the metric to the
metric_evaluation.py
file. For this, you can follow the necessary additions from, e.g., the WEAT metric, which you'll find in line 118ff. and 183. If your metric uses either the WEAT or the RNSB lexicon format, you can use the lexicon preparation functions implemented inwebias/utils.py
. Otherwise, you need to supply your own lexicon preparation, which creates all the different lexicon shuffles/variations with which a metric will be evaluated. - Lastly, you have to adapt the
run-pipeline.sh
file to actually run the evaluation with your new metric.
- If your embedding model is in the standard word2vec format that is interpretable with the
gensim
library (basically a.txt
file where each line represents a token and its vector, space separated), simply copy the file to thedata/word-vectors
directory and adapt the variables at the top of therun-pipeline.sh
file accordingly (namely, you need to adjust the contents of theMODEL_PATHS
andMODEL_LOWERCASED
variables). - If the model is not in that format, a new embedding model reader might need to be implemented. You can add a new class in
webias/word_vectors.py
that inherits from theBaseEmbeddings
abstract class. Furthermore, an additional special case needs to be added to themetric_evaluation.py
file in line 33ff and to thefitler_bias_lexicons.py
file in line 48ff.
Since, at the time of conducting the experiments, there was no official WEAT implementation available publicly, we re-implemented the approach from the information available in the original paper and its supplementary material (you can find both here). While the evaluation results of the pre-trained word embeddings models with our implementation are not exactly the same, we attribute those smaller changes to implementation details. You can run the score replications with $ python -m unittest webias.tests.weat_score_replication_w2v
for the word2vec embedding model and $ python -m unittest webias.tests.weat_score_replication_glove
for the GloVe embedding model. Passing tests are within a boundary specified in the webias/constants.py
file. The tests use the word lists published in the original paper, which can also be found at webias/data/weat_tests.json
.
The tests additionally require you to download the word2vec embedding model from here and place the extracted file (GoogleNews-vectors-negative300.bin
) into the data/word-vectors
directory. As described above, the GloVe embedding will be downloaded automatically during runtime (if they are not already present).
Since, at the time of conducting the experiments, there was no official RNSB implementation available publicly, we re-implemented the approach from the information available in the original paper (you can find it here). While the evaluation results of the pre-trained word embeddings models with our implementation are not exactly the same, we attribute those smaller changes to implementation details. You can run the score replications with $ python -m unittest webias.tests.rnsb_score_replication_w2v
for the word2vec embedding model, $ python -m unittest webias.tests.rnsb_score_replication_conceptnet
for the numberbatch embedding model and $ python -m unittest webias.tests.weat_score_replication_glove
for the GloVe embedding model. Passing tests are within a boundary specified in the webias/constants.py
file. The tests use the word lists published in the original paper, which can also be found at webias/data/rnsb_tests.json
.
The tests additionally require you to download the word2vec embedding model from here and place the extracted file (GoogleNews-vectors-negative300.bin
) into the data/word-vectors
directory. As described above, the GloVe and Numberbatch embedding models will be downloaded automatically during runtime (if they are not already present).
The implementation of the ECT metric was taken from the code published by the authors. A more detailed description can be found in the comments of the code file. As our implementation uses more or less the same code as published by the authors, we didn't see a need to additionally verify the implementation.
The file data/social-bias-lexicon.json
contains a multitude of word lists compiled from different related works. The sources are referenced in the respective "source"
field of each list. Only a selection of those were utilized in our original publication, though.
Contributions and PRs are welcome! Please try to follow the flake8 and editorconfig rules specified in the respective files (there are editor plugins for both). :)