BarcodeMamba, a performant and efficient foundation model for DNA barcodes in biodiversity analysis.
- Check out our paper
- Check out our poster
- Our pretrained models are at HuggingFace
These instructions are for Linux (or Windows Subsystem for Linux).
- Install CUDA 12.1 from the NVIDIA CUDA Toolkit website, or load the cuda-12.1 module if it has been installed.
source /etc/profile.d/lmod.sh
module load cuda-12.1
- Install the required libraries:
git clone https://github.com/bioscan-ml/BarcodeMamba.git
cd BarcodeMamba
conda env create -n BarcodeMamba -f environment.yml
conda activate BarcodeMamba
pip install https://github.com/state-spaces/mamba/releases/download/v2.2.1/mamba_ssm-2.2.1+cu122torch2.3cxx11abiFALSE-cp38-cp38-linux_x86_64.whl
pip install https://github.com/Dao-AILab/causal-conv1d/releases/download/v1.3.0.post1/causal_conv1d-1.3.0.post1+cu122torch2.3cxx11abiFALSE-cp38-cp38-linux_x86_64.whl
- Download the datasets preprocessed by BarcodeBERT authors
mkdir data
wget https://vault.cs.uwaterloo.ca/s/x7gXQKnmRX3GAZm/download -O data.zip
unzip data.zip
mv new_data/* data/
rm -r new_data
rm data.zip
- We recommend using A40 GPUs to avoid IndexError
- For pretraining with character-level tokenization:
python train.py dataset=CanadianInvertebrates1.5M-pretrain tokenizer=char dataset.input_path=/absolute/path/to/data
If you are using multiple gpus, use torchrun to start training:
torchrun \
--nproc_per_node=4 \
--nnodes=1 \
--node_rank=0 \
train.py dataset=CanadianInvertebrates1.5M-pretrain tokenizer=char dataset.input_path=/absolute/path/to/data
- Pretraining with K-mer tokenization:
python train.py dataset=CanadianInvertebrates1.5M-pretrain tokenizer=k-mer tokenizer.k_mer=6 dataset.input_path=/absolute/path/to/data
- Finetuning and testing with character-level tokenization:
python train.py dataset=CanadianInvertebrates1.5M-finetune dataset.input_path=/absolute/path/to/data model.n_classes=1653 tokenizer=char train.pretrained_model_path=/absolute/path/to/ckpt
- Finetuning and testing with K-mer tokenization:
python train.py dataset=CanadianInvertebrates1.5M-finetune dataset.input_path=/absolute/path/to/data model.n_classes=1653 tokenizer=k_mer tokenizer.k_mer=6 train.pretrained_model_path=/absolute/path/to/ckpt
Run the following commands to build a folder_of_pretrained_barcode_mamba: (taking BarcodeMamba-dim384-layer2-char
as an example)
mkdir BarcodeMamba-dim384-layer2-char && cd BarcodeMamba-dim384-layer2-char
mkdir .hydra checkpoints
wget -O .hydra/config.yaml https://huggingface.co/bioscan-ml/BarcodeMamba/resolve/main/BarcodeMamba-dim384-layer2-char/.hydra/config.yaml?download=true
wget -O checkpoints/last.ckpt https://huggingface.co/bioscan-ml/BarcodeMamba/resolve/main/BarcodeMamba-dim384-layer2-char/checkpoints/last.ckpt?download=true
- Linear probing on seen species:
Note that this will generate a probing_outputs
directory for logs.
python linear_probing.py -d /absolute/path/to/folder_of_pretrained_barcode_mamba --input-path /absolute/path/to/data
- 1-NN probing on unseen species:
Note that this will generate a probing_outputs
directory for logs.
python knn_probing.py -d /absolute/path/to/folder_of_pretrained_barcode_mamba --input-path /absolute/path/to/data
For BarcodeBERT, CNN, DNABERT, DNABERT-2 baselines, please refer to the instructions in BarcodeBERT.
- Finetuning a HyenaDNA model on our task.
python ssm_baselines.py model_name=hyenadna dataset.input_path=/absolute/path/to/data checkpoint=LongSafari/hyenadna-small-32k-seqlen-hf
- Finetuning a Caduceus model on our task.
python ssm_baselines.py model_name=mambadna dataset.input_path=/absolute/path/to/data checkpoint=kuleshov-group/caduceus-ph_seqlen-1k_d_model-256_n_layer-4_lr-8e-3
- Using pretrained baseline models for probing
Linear probing:
Note that this will generate a baseline_probing_outputs_linear
directory for logs.
python linear_probing_baseline.py --model-name hyenadna --input_path /absolute/path/to/data --checkpoint LongSafari/hyenadna-small-32k-seqlen-hf
python linear_probing_baseline.py --model-name mambadna --input_path /absolute/path/to/data --checkpoint kuleshov-group/caduceus-ph_seqlen-1k_d_model-256_n_layer-4_lr-8e-3
1-NN probing:
Note that this will generate a baseline_probing_outputs_knn
directory for logs.
python knn_probing_baseline.py --model-name hyenadna --input_path /absolute/path/to/data --checkpoint LongSafari/hyenadna-small-32k-seqlen-hf
python knn_probing_baseline.py --model-name mambadna --input_path /absolute/path/to/data --checkpoint kuleshov-group/caduceus-ph_seqlen-1k_d_model-256_n_layer-4_lr-8e-3
Other variants of HyenaDNA and Caduceus models used in our paper:
checkpoint=LongSafari/hyenadna-tiny-1k-seqlen-d256-hf
checkpoint=kuleshov-group/caduceus-ps_seqlen-1k_d_model-256_n_layer-4_lr-8e-3
checkpoint=kuleshov-group/caduceus-ph_seqlen-131k_d_model-256_n_layer-16
checkpoint=kuleshov-group/caduceus-ps_seqlen-131k_d_model-256_n_layer-16
If you find BarcodeMamba useful, please consider citing:
@inproceedings{
gao2024barcodemamba,
title={BarcodeMamba: State Space Models for Biodiversity Analysis},
author={Tiancheng Gao and Graham W.~Taylor},
booktitle={{NeurIPS} 2024 Workshop on Foundation Models for Science: Progress, Opportunities, and Challenges},
year={2024},
url={https://openreview.net/forum?id=6ohFEFTr10}
}