To train the model, use the following command:
bash train.sh ${CUDA_DEVICE}
The project consists of the following files:
data/
: We use the ChEBI-20 dataset from text2mol for the main experiments. For training, val and test sets, we discard invalid molecules without any chemical bonds. Additionally, we add CanonicalSMILES and molecule names from PubChem for these three sets.graph_data/
: unzipmol_graphs.zip
from text2moltoken_embedding_dict.npy
: from text2moltraining.csv
: processed bypreprocess.py
based ontraining.txt
from text2molval.csv
: processed bypreprocess.py
based onval.txt
from text2moltest.csv
: processed bypreprocess.py
based ontest.txt
from text2molpreprocess.py
: runpython3 preprocess.py
allenai_scibert_scivocab_uncased/
: SciBERT path.config.json
train.sh
main.py
modeling.py
dataloader.py
chemutils.py
utils.py
requirements.txt