This repository accompanies the paper:
NNOSE: Nearest Neighbor Occupational Skill Extraction
Mike Zhang, Rob van der Goot, Min-Yen Kan, and Barbara Plank. To appear at EACL 2024.
Clone the repository. If you use conda
, please install the accompanying environment by:
# create the environment
conda env create -f environment.yml
# activate environment
conda activate nnose
# install torch separately
pip3 install torch==1.10.1+cu113 -f https://download.pytorch.org/whl/torch_stable.html
There is a separate environment for generating the UMAP plot.
# create the environment
conda env create -f environment_umap.yml
# activate environment
conda activate nnose_umap
# install torch separately
pip3 install torch==1.7.1+cu110 -f https://download.pytorch.org/whl/torch_stable.html
! The UMAP plot can only be created once we obtained the embeddings using run_scripts/get_representations.sh
.
All experiments are ran on python=3.9
and torch==1.10.1
.
JobBERTa will be released when the paper is accepted. You can check how JobBERTa is trained using:
run_scripts/run_mlm.sh
.
The MLM script is derived from HuggingFace and can be found in src/utils/run_mlm.py
.
To fine-tune the models used in the paper, run the following script:
bash run_scripts/run_trainer.sh
We have put the extraction of embeddings from the training datasets and training the datastore in one file.
We have two types of datastores in our experiments, an in-dataset datastore ({D}) and an 'all' datastore (
To create the in-dataset datastore:
bash run_scripts/get_representations_dataset.sh
To create the \forall datastore:
bash run_scripts/get_representations.sh
To do a hyperparameter sweep for NNOSE to get the best working k neighbors, lambda, and temperature run:
bash run_scripts/run_inference_sweep.sh
We have two scripts to run test/inference, which also outputs the predictions in a separate output file. To do this, please look at the following scripts:
run_scripts/run_test.sh
or to do test/inference with NNOSE:
run_scripts/run_test_index.sh
This analysis doesn't need any already ran experiments, to do this run:
python3 src/analysis/skill_distribution.py
This analysis requires the output of the models from step (4) above:
Example
python3 src/analysis/get_long_tail.py \
--train_dir data/skillspan/train.json \
--prediction_dir results/<name_of_file> \
This analysis also requires the output of the models from step (4) above:
python3 src/analysis/skill_distribution.py \
--prediction_dir results/<name_of_file_predictions> \
--prediction_dir_knn results/<name_of_file_predictions_knn> \
This analysis requires you to have trained the models. Note that there is also an "all" model, which is the concatenation of all datasets.
This one is done manually with the run_scripts/run_test.sh
script:
Example
MODEL="jobbert" # "roberta" "jobberta"
DATASET="skillspan" # "sayfullina" "green"
TIMESTAMP=$(date +%F_%T)
python3 src/run_inference.py \
--model_name_or_path "tmp-5e-5/"$DATASET"/$MODEL/"* \
--train_file data/"$DATASET"/train.json \
--validation_file data/sayfullina/dev.json \
--text_column_name tokens \
--label_column_name tags_skill \
--seed 113412 \
--write_output "results/run_test_$TIMESTAMP/" \
You can change the --validation_file
flag to the dataset you want to apply it on. In this case, we use sayfullina
.
Please change to the UMAP environment as stated in the Getting Started
section. The only difference is the pytorch version.
This analysis further requires you to have ran step (2): Obtaining Embeddings + Creating the Datastore.
Example:
python3 src/analysis/plot_umap.py --output_dir plots/
WARNING: This script takes a lot of time if run for the first time (around 45-60 minutes on a good machine).
If there's any questions, please reach out to <email>.