Protein domain prediction (for over 17000+ types of protein strings)
Download the data from kaggle and place it in a
folder.
It's expected that the folder structure is as follows:
--data_dir = "data/random_split"
data
├── random_split
│ ├── dev
│ ├── test
│ ├── train
Download the language encoder from here and place
it in a folder.
--lang_params = "path/to/lang_params/sample.pickle"
Download any model checkpoint and put them in a folder. You can then specify the parameters as follows in scripts.
--model_checkpoint = "path/to/model_weights/sample.ckpt"
--lang_params = "path/to/lang_params/sample.pickle"
# and so on...
Models | Download link (weights) | test accuracy |
---|---|---|
Default ProtoCNN | link | 87.46% |
Default ProtoCNN + hyperparameter tuning | link | 90.08% |
Custom Model (more details in Model Specification) | link | 92.31% |
docker build . -t instadeep:latest
# CPU only
docker run --rm -it --entrypoint bash instadeep:latest
# GPU
docker run --rm -it --entrypoint bash --gpus=all instadeep:latest
python src/visualizations/visualize.py --data_dir data/random_split --save_path reports/data_visualizations --partition "train"
Many other options are available as well, pl see python src/visualizations/visualize.py --help
(Note: batch_size needs to be much smaller on CPU (bs=1). To use GPU use the --gpu flag.)
python src/train.py --batch_size=256
Many other options are available as well, pl see python src/train.py --help
python src/visualizations/visualize_training_vals.py --metrics_file "path/to/file/sample.csv" --save_path "path/to/folder"
Many other options are available as well, pl see python src/visualizations/visualize.py --help
python src/predict.py --input_seq="Protein_seq" --model_checkpoint="lightning_logs/version_10/checkpoints/epoch=2-step=12738.ckpt"
Many other options are available as well, pl see python src/predict.py --help
python src/evaluate.py --gpu --model_checkpoint="lightning_logs/version_10/checkpoints/epoch=2-step=12738.ckpt" --test_set_dir="data/random_split/test"
Many other options are available as well, pl see python src/evaluate.py --help
(Tested on Python version 3.10.13)
# Install requirements (python 3.10)
pip install -r requirements.txt
# Export python path
export PYTHONPATH="${PYTHONPATH}:full/path/to/the/folder/Instadeep_takehome/"
"""Run any of the above commands now."""
We will be using pytest for this.
# Run tests
coverage run -m pytest src/tests/
# Generate coverage report
coverage report -m
# Run tensorboard by locating the tf_events file.
tensorboard --logdir=path/to/tensorboard/folder/sample_folder
- I modified the architecture of the model by increasing the residual blocks, adding convolutional layers, increasing layer sizes and changing the input and output channels and made other small changes.
- The model architecture is as follows: default proto_cNN (left), modified bigger model (right)
- Note: To run the bigger model we'll have to do more changes like changes to the class ProtoCNN as well. It won't work out of box.