-
Notifications
You must be signed in to change notification settings - Fork 3
/
readme.txt
27 lines (14 loc) · 1.04 KB
/
readme.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
rna_k_mer_tokenizer.py: creates tokenizer .json file by reading k-mer pretraining data
bert-rna-model.json: Find an online example for Bert configuration and modified it. Reduced number of layers and vocabulary size. Added num_labels
bert-rna-6-mer-tokenizer.json: Output of run_k_mer_tokenizer.py.
make_k_mers.py: turns nucleotide sequence into given k-mer sequences.
run_mlm.py: masked language model pretraining. Modified to pretrain from scratch and to read sequence data. Default values are updated for our purpose.
fintune.py: finetunes pretrained model with family Classification task
plot_metrics.py: Gets checkpoint directory and plots loss, accuracy
plot_dataset.py: Used for dataset length distribution and size.
conda create -n CS230 python=3.10
pip install -r requirements.txt
python run_mlm.py --output_dir ./out_mlm
python run_mlm.py --output_dir ./out_mlm --resume ./out_mlm/chekpoint-XXXX
python run_cls.py --output_dir ./out_cls --model_name_or_path ./out_mlm/
python run_cls.py --output_dir ./out_cls --resume ./out_cls/checkpoint-XXXX