Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fine-tuning with more protein sequences #53

Open
avilella opened this issue Aug 13, 2024 · 10 comments
Open

fine-tuning with more protein sequences #53

avilella opened this issue Aug 13, 2024 · 10 comments

Comments

@avilella
Copy link

Hi, I have a corpus of about 500,000 protein sequences and would like to apply them to existing models like ESM2 or this one for predicting the fitness effect of changing an amino-acid for another.
How could I add my sequences to the models referred in this repo to then use the modified model for such task? Thanks.

@LTEnjoy
Copy link
Contributor

LTEnjoy commented Aug 13, 2024

Hi, typically this is a regression task, i.e. inputting a protein sequence to the model and getting the output value about the fitness effect. If your 500,000 protein sequences are derived from the same wild type protein, then a normal pipeline to fine-tune SaProt would be:

  1. Retrieve the protein structure of the wild type protein, normally from AlphaFold2 and encode it to structural sequence using foldseek.
  2. Construct structure-sware (SA) sequences by combining your protein sequences with the structural sequence. For instance, if the protein sequence is MEEV and the structural sequence is ddvv, then the SA sequence is Md Ed Ev Vv. By doing so, you could construct a SA training set for your 500,000 sequences.
  3. Fine-tune SaProt. SaProt could be initialized by huggingface interface. You could initialize SaProt with a regression head and then fine-tune it on your dataset.

The steps above constitute a normal pipeline of fine-tuning SaProt on your own dataset. It might be complicated for people who are not very familiar with ML techniques. Alternatively we recommend you use ColabSaprot to train your own model with only few clicks, see here. By using ColabSaprot you only have to upload your dataset and the system will automatically train the model on your data. We will also plot the training curve so you can track the training process.

@avilella
Copy link
Author

Thanks, I'll have a look at ColabSaprot. My 500,000 protein sequences are part of a corpus that hasn't been seen by any model, but I could use AF2 or similar to generate 3D models for them. We don't have empirical data for the fitness, only the protein sequences, but this corpus of data hopefully will modify the existing models enough so that the answers are not biased by the species that are most represented, e.g. human or mouse. Hopefully that makes sense.

@LTEnjoy
Copy link
Contributor

LTEnjoy commented Aug 13, 2024

If you don't have experimental labels for the fitness, you could predict the mutational effect in a zero-shot manner. In this case, you don't have to further tune the model and could directly make predictions for interested mutations. ColabSaprot provides a specific module for doing so (see this part 3.2), or you can run the provided code to make prdiction (see this part).

Even the model didn't see those protein sequences during training, I think it is capable of predicting the changed fitness to some degree. Hopy you could try it out and advance your research:)

@wangjqspace
Copy link

Hi, typically this is a regression task, i.e. inputting a protein sequence to the model and getting the output value about the fitness effect. If your 500,000 protein sequences are derived from the same wild type protein, then a normal pipeline to fine-tune SaProt would be:

  1. Retrieve the protein structure of the wild type protein, normally from AlphaFold2 and encode it to structural sequence using foldseek.
  2. Construct structure-sware (SA) sequences by combining your protein sequences with the structural sequence. For instance, if the protein sequence is MEEV and the structural sequence is ddvv, then the SA sequence is Md Ed Ev Vv. By doing so, you could construct a SA training set for your 500,000 sequences.
  3. Fine-tune SaProt. SaProt could be initialized by huggingface interface. You could initialize SaProt with a regression head and then fine-tune it on your dataset.

The steps above constitute a normal pipeline of fine-tuning SaProt on your own dataset. It might be complicated for people who are not very familiar with ML techniques. Alternatively we recommend you use ColabSaprot to train your own model with only few clicks, see here. By using ColabSaprot you only have to upload your dataset and the system will automatically train the model on your data. We will also plot the training curve so you can track the training process.

I have a quick question—if we want to fine-tune SaProt with our own labeled data, how should we prepare the .mdb file? The .mdb files on the website seem to be password-protected, so we can't access the data structure. Could you provide a non-password-protected version, for instance, for the thermostability dataset? Thanks in advance!

@LTEnjoy
Copy link
Contributor

LTEnjoy commented Aug 19, 2024

Hi, you could refer to this issue #16 for some details.

@wangjqspace
Copy link

Hi, you could refer to this issue #16 for some details.

Thanks for the timely reply. Good day!

@wangjs188
Copy link

Hello, extensive use of Colab's GPU requires some tricks and additional funding. I primarily conduct wet lab experiments and am not very strong in machine learning. Could you please advise how to convert a CSV file into the "foldseek" folder and the "normal" folder with MDP files for local training, similar to what ColabSaprot does? Is there a detailed tutorial available?thanks

@LTEnjoy
Copy link
Contributor

LTEnjoy commented Sep 2, 2024

Hello, extensive use of Colab's GPU requires some tricks and additional funding. I primarily conduct wet lab experiments and am not very strong in machine learning. Could you please advise how to convert a CSV file into the "foldseek" folder and the "normal" folder with MDP files for local training, similar to what ColabSaprot does? Is there a detailed tutorial available?thanks

Hi, if you have local gpus for training, you could deploy ColabSaprot on your local server without using google cloud. Here is the quick tutotial for your deployment: https://github.com/westlake-repl/SaprotHub/tree/main/local_server

@wangjs188
Copy link

Thanks for the timely reply.

@har77774
Copy link

Hello, can I post training SaProt with my own protein sequences? Not fine-tuning.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants