We provide both the raw and processed data at this HuggingFace link.
mkdir -p pcqm4mv2/raw
cd pcqm4mv2/raw
wget http://ogb-data.stanford.edu/data/lsc/pcqm4m-v2-train.sdf.tar.gz
tar -xf pcqm4m-v2-train.sdf.tar.gz
wget http://ogb-data.stanford.edu/data/lsc/pcqm4m-v2.zip
unzip pcqm4m-v2.zip
mv pcqm4m-v2/raw/data.csv.gz .
rm pcqm4m-v2.zip
rm -rf pcqm4m-v2
wget https://dataverse.harvard.edu/api/access/datafile/4327252
mv 4327252 rdkit_folder.tar.gz
tar -xvf rdkit_folder.tar.gz
Install it following the google drive link here.
Automatically installed under folder .QM9/raw
.
Automatically installed under folder ./MD17
.
In March 2023 (or even earlier), they updated the MD17 FTP site, and the previous datasets are missing. We may need to keep and upload a version to the website.
Download the dataset from this link, and put the file 12672038.zip
under ./rMD17
folder.
unzip 12672038.zip
tar xjf rmd17.tar.bz2
mv rmd17/npz_data .
mv rmd17/splits .
We use this repo: git@github.com:TUM-DAML/gemnet_pytorch.git
.
mkdir -p lba/raw
mkdir -p lba/processed
cd lba/raw
# wget http://www.pdbbind.org.cn/download/pdbbind_v2015_refined_set.tar.gz
# wget http://www.pdbbind.org.cn/download/pdbbind_v2018_refined.tar.gz
# wget http://www.pdbbind.org.cn/download/pdbbind_v2019_refined.tar.gz
# wget https://zenodo.org/record/4914718/files/LBA-split-by-sequence-identity-30-indices.tar.gz
wget http://www.pdbbind.org.cn/download/PDBbind_v2020_refined.tar.gz
tar -xzvf PDBbind_v2020_refined.tar.gz
wget https://zenodo.org/record/4914718/files/LBA-split-by-sequence-identity-30.tar.gz
tar -xzvf LBA-split-by-sequence-identity-30.tar.gz
mv split-by-sequence-identity-30/indices ../processed/
mv split-by-sequence-identity-30/targets ../processed/
mkdir -p lep/raw
mkdir -p lep/processed
cd lep/raw
wget https://zenodo.org/record/4914734/files/LEP-raw.tar.gz
tar -xzvf LEP-raw.tar.gz
wget https://zenodo.org/record/4914734/files/LEP-split-by-protein.tar.gz
tar -xzvf LEP-split-by-protein.tar.gz
wget http://snap.stanford.edu/gnn-pretrain/data/chem_dataset.zip
unzip chem_dataset.zip
dataset_list=(tox21 toxcast clintox bbbp sider muv hiv bace)
for dataset in "${dataset_list[@]}"; do
mkdir -p molecule_datasets/"$dataset"/raw
cp dataset/"$dataset"/raw/* molecule_datasets/"$dataset"/raw/
done
rm -rf dataset
wget -O malaria-processed.csv https://raw.githubusercontent.com/HIPS/neural-fingerprint/master/data/2015-06-03-malaria/malaria-processed.csv
mkdir -p ./molecule_datasets/malaria/raw
mv malaria-processed.csv ./molecule_datasets/malaria/raw/malaria.csv
wget -O cep-processed.csv https://raw.githubusercontent.com/HIPS/neural-fingerprint/master/data/2015-06-02-cep-pce/cep-processed.csv
mkdir -p ./molecule_datasets/cep/raw
mv cep-processed.csv ./molecule_datasets/cep/raw/cep.csv
Check this link.
ProtFunct
is for taskEC
HomologyTAPE
is for taskFOLD
Or
cd EC; python download.py
cd FOLD; python download.py
mkdir MatBench
cd MatBench
wget https://figshare.com/ndownloader/files/17494820
mv 17494820 expt_is_metal.json.gz
gzip -d expt_is_metal.json.gz
wget https://figshare.com/ndownloader/files/17494814
mv 17494814 expt_gap.json.gz
gzip -d expt_gap.json.gz
wget https://figshare.com/ndownloader/files/17494637
mv 17494637 glass.json.gz
gzip -d glass.json.gz
wget https://figshare.com/ndownloader/articles/9755486/versions/2
mv 2 perovskites.json.gz
unzip perovskites.json.gz
rm perovskites.json.gz
rm 17494805_perovskites.json.gz
gzip -d 17494808_perovskites.json.gz
mv 17494808_perovskites.json perovskites.json
wget https://figshare.com/ndownloader/files/17476067
mv 17476067 dielectric.json.gz
gzip -d dielectric.json.gz
wget https://figshare.com/ndownloader/files/17476064
mv 17476064 log_gvrh.json.gz
gzip -d log_gvrh.json.gz
wget https://figshare.com/ndownloader/files/17476061
mv 17476061 log_kvrh.json.gz
gzip -d log_kvrh.json.gz
wget https://figshare.com/ndownloader/files/17476046
mv 17476046 jdft2d.json.gz
gzip -d jdft2d.json.gz
wget https://figshare.com/ndownloader/files/17476040
mv 17476040 steels.json.gz
gzip -d steels.json.gz
wget https://figshare.com/ndownloader/files/17476037
mv 17476037 phonons.json.gz
gzip -d phonons.json.gz
wget https://figshare.com/ndownloader/files/17476034
mv 17476034 mp_is_metal.json.gz
gzip -d mp_is_metal.json.gz
wget https://figshare.com/ndownloader/files/17476028
mv 17476028 mp_e_form.json.gz
gzip -d mp_e_form.json.gz
wget https://figshare.com/ndownloader/files/17084741
mv 17084741 mp_gap.json.gz
gzip -d mp_gap.json.gz
The dataset size can match with MatBenchmark v0.1.
mkdir QMOF
cd QMOF
wget https://figshare.com/ndownloader/articles/13147324/versions/13
mv 13 qmof_database_v13.zip
unzip qmof_database_v13.zip
unzip qmof_database.zip
cd qmof_database
python xyz_to_cifs.py
cd ../..
Or follow this link for prediction on QMOF DB v13.