CPAE-PyTorch is a library of CPAE (Consistency Penalized AutoEncoder) re-implemented with PyTorch. The model is introduced by EMNLP 2018 paper "Auto-Encoding Dictionary Definitions into Consistent Word Embeddings", and its original implementation could be found here.
This repo is developed on Python 3.6, PyTorch 1.0.0 and AllenNLP 0.8.5.
You may create experimental environment by conda as follows:
conda env create -f=environment.yml
Or, install dependencies step by step:
conda create -n cpae-pytorch python=3.6
conda activate cpae-pytorch
conda install pytorch=1.0 cudatoolkit=9.0 -c pytorch
pip install allennlp jsonlines
The default configuration is provided in training_config
, which you can play with. You may change alpha
(autoencoding coefficient) and beta
(consistency penalty coefficient) to switch between AutoEncoder and Consistency Penalized AutoEncoder, or provide pre-trained word embedding as a way to improve it.
As the implementation is based on AllenNLP, a configurable flexible library, you can change any component with its counterpart component, add helpful components, or delete useless components.
For convenience and fair comparison, we include en_wn_full_all.jsonl
and vocab.txt
in data directory, which are generated by the original CPAE code.
To train a model, just run as follows:
allennlp train -s path/to/serialization/directory training_config/cpae.jsonnet --include-package cpae
Generate definition embeddings using AllenNLP's predictor:
allennlp predict path/to/serialization/directory/model.tar.gz data/en_wn_full_all.jsonl --output-file path/to/serialization/directory/definition_embeddings.txt --include-package cpae --predictor cpae_definition_embedding_generator --batch-size 32 --cuda 0 --silent
sed -i 's/^"//g' path/to/serialization/directory/definition_embeddings.txt
sed -i 's/"$//g' path/to/serialization/directory/definition_embeddings.txt
After generating definition embeddings (glove format, i.e., no first line), they can be evaluated or used just like usual word embeddings.
We compare our re-implemented models with the original models using the included word-embeddings-benchmarks toolkit (the original version of toolkit can be found here).
As we can see, the models achieve comparable, sometimes better performance as the originals.
Model | MEN-dev | MEN-test | MTurk | RG65 | RW | SCWS | SimLex333 | SimLex999 | SimVerb3500-dev | SimVerb3500-test | WS353 | WS353R | WS353S | AP | BLESS | Battig | ESSLI_1a | ESSLI_2b | ESSLI_2c | MSR | SemEval2012_2 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
our AE (alpha=1, beta=0) | 0.399109683 | 0.44381856 | 0.374776443 | 0.520243471 | 0.186448245 | 0.495065492 | 0.253624435 | 0.368178852 | 0.357852756 | 0.349119334 | 0.430635419 | 0.292890592 | 0.55375016 | 0.514925373 | 0.59 | 0.228445804 | 0.545454545 | 0.7 | 0.444444444 | 0.083862055 | 0.1 | 0.128368539 |
original AE (alpha=1, beta=0) | 0.384803476 | 0.424013127 | 0.374223152 | 0.596125059 | 0.141162454 | 0.47554452 | 0.26243494 | 0.334538441 | 0.367640014 | 0.331242873 | 0.407243453 | 0.26709243 | 0.526226658 | 0.480099502 | 0.515 | 0.225960619 | 0.568181818 | 0.675 | 0.511111111 | 0.088518215 | 0.1045 | 0.117133135 |
our CPAE (alpha=1, beta=8) | 0.498663069 | 0.496606982 | 0.433008813 | 0.634411542 | 0.256603718 | 0.551788864 | 0.259022761 | 0.394054538 | 0.425242418 | 0.368174528 | 0.543721278 | 0.440885165 | 0.634893993 | 0.509950249 | 0.5 | 0.243356911 | 0.590909091 | 0.725 | 0.466666667 | 0.025890299 | 0.047125 | 0.129653634 |
original CPAE (alpha=1, beta=8) | 0.498157962 | 0.495570312 | 0.434743114 | 0.556321716 | 0.234406662 | 0.537071954 | 0.242319671 | 0.387031863 | 0.415217566 | 0.347100864 | 0.480991963 | 0.382172741 | 0.5842947 | 0.509950249 | 0.47 | 0.240298222 | 0.613636364 | 0.75 | 0.577777778 | 0.016373312 | 0.030875 | 0.117190979 |
our CPAE (alpha=1, beta=64, word2vec) | 0.660632874 | 0.668232132 | 0.542060783 | 0.811922197 | 0.324839691 | 0.627628157 | 0.346681441 | 0.471233914 | 0.484940154 | 0.435970855 | 0.600053185 | 0.478884821 | 0.709479011 | 0.641791045 | 0.67 | 0.319441789 | 0.772727273 | 0.75 | 0.577777778 | 0.027629963 | 0.04625 | 0.183607132 |
original CPAE (alpha=1, beta=64, word2vec, reported in paper) | 0.651 | 0.638 | 0.615 | 0.72 | - | 0.604 | 0.309 | 0.458 | 0.441 | 0.423 | 0.613 | - | - | - | - | - | - | - | - | - | - | - |
(The original models corespond to s2sg_w2v_defs_1_pen0
and s2sg_w2v_defs_1_pen8
configurations respectively.)