Position: Middle NLP Engineer.
Word Sense Induction assignment
The task is based on: link
Data is stored in a folder data
Since data is stored as submodule (URL), to download it execute:
git submodule init
git submodule update
Train data: main/active-dict
Test data: additional/active-rutenten
Baseline: script, main/active-dict, additional/active-rutenten
This folder contains third-party libraries/repositories/weights that are used in experiments, namely:
- bertwsi - Word Sense Induction with BERT
- spacy-ru - russian language models for spaCy (used in bertwsi)
- simple_elmo - simple library to work with pre-trained ELMo models in TensorFlow
Since modules are git repositories, they are stored as submodules.
To download it execute:
git submodule init
git submodule update
To install adagram model execute:
pip install git+https://github.com/lopuhin/python-adagram.git
ruscorpora_mean_hs.model.bin.gz
word2vec weights (from RusVectores) are stored using git lfs
To download it install git lfs and execute:
git lfs fetch
git lfs checkout
To download ruwikiruscorpora_lemmas_elmo_1024_2019
ELMo weights (from RusVectores) execute:
./download_elmo_weights.sh
Note: should be executed from ml_interviews/mts/modules
All solutions are stored in a folder solutions:
All predictions are stored in a folder predictions: