This repo contains the code and data for my term paper "Fine-tuning language models on text clustering."
This folder contains the clustering module. It should be installed using pip.
Contains the code for all experiments presented in the term paper (and artefacts of old experiments).
Stores the datasets used for all experiments.
Will contain the results of each experiments.
Contains some notebooks used to create plots and tables and some relicts of the development process.
Install all dependencies, either using pip or conda.
# Very likely it won't work:
pip install -r requirements.txt
or
# This will create a new conda environment.
# Either:
conda create --name <env> --file requirements_conda.txt
# Or (better):
conda env create -f conda_environment.yml
# The latter will create a conda env named deep_text_clustering, containing nearly all required packages.
If you are using conda, it is necessary to install these packages manually:
- plotly-orca
- torchtext
- umap-learn
- transformers
- cudatoolkit
- sacred
Install the clustering module using pip:
# From project root.
pip install - e .
Each datasets come with a script that downloads and creates the dataset as csv files.
To create the AG_News datasets
first run the make_ag_news_datasets.py
from the datasets/ag_news
folder.
To create each subset and save the train and validation splits run the ag_news_create_splits.py
from its location inside the datasets
folder.`
To create these datasets just run their creation script from inside their folders.
The experiments have to be started from inside the ``experiments` folder. Sacred is used to keep track of all experiments. If you want to enable sacred to save all results using MongoDB, you can specify the following environments variables:
- MONGO_SACRED_ENABLED = true
- (If it is set to false the experiments will only be tracked using local files)
- mongo_user = MongoDB username
- mongo_pass = Password for the user
- mongo_host = Host of the MongoDB instance
- mongo_port = Port of the MongoDB instance
To run certain experiments without MongoDB tracking, you start it with the flag:
MONGO_SACRED_ENABLED=false python <experiment>.py
No matter if MongoDB is enabled each experiment will be tracked locally and the results are written
to results/sacred_runs
Parameters of each run can be changed using the sacred syntax:
python <experiment>.py with lr=1.0 n_epochs=1
The results of each run will be written to a
folder following the pattern: <project_root>/results/<experiment_name>/<timestamp>