Structural Residue Variant Training on Single Protein in a Tetramer Complex #590

lloydtripp · 2024-03-07T15:54:19Z

lloydtripp
Mar 7, 2024

I’m a graduate student at Washington University in St. Louis doing some machine learning for variant effect predictions. I found the DeepRank software to be the best at capturing protein structure as a graph representation for machine learning regression (predicting a continuous variable). This was something I was hoping to implement in my case.

I wanted to make sure my experimental setup made sense with your software. I used Rosettafold2 to predict structures for all possible single amino acid substitutions in a single chain that is in place of a multimer (3 other constant chains). These structures would be fed into DR2 to predict on functional data (deep mutational scan of protein function for each protein variant). It seems like the software is setup to only capture parts of a protein (small peptide region and nearby atoms). Ideally, I’m taking in the whole structure and having it learn the structure to function relationship. The end goal is to learn what about the structure is important.

My main questions:

Is there a recommendation towards influence_radius and max_edge_length sizing in this case? I assume as large as possible but I could be wrong.
Will DR2 work to put in extreme values for influence_radius and max_edge_length during the QueryCollection to capture the graph representation of the entire protein (and complex)?
Is it reasonable to expect the ML to learn based on one protein? Usually these problems have many proteins in their dataset. I have 5k data points so I’m hoping for the best.

Thanks again for making very useful software! The documentation and code was all put together very professionally!

DaniBodor · 2024-03-08T16:43:37Z

DaniBodor
Mar 8, 2024
Maintainer

Hi @lloydtripp , thanks again for reaching out.

To answer your questions

Is there a recommendation towards influence_radius and max_edge_length sizing in this case? I assume as large as possible but I could be wrong.

I find it hard to answer this tbh, because it's your experiment that you do as you think is best :)
If, as you say, you want to take "the whole structure and having it learn the structure to function relationship", then the only way to do that is to set these parameters very large (larger than the size of your structure) to ensure everything is included. However, this will severely slow the process, maybe even prohibitively so. This will obviously depend on your hardware and parallelization opportunities, etc as well, but I fear that it will still be very slow, no matter where you run it. We've never tested this ourselves, though.
The question I would have, and which only you can answer, is: do you really need to graph the entire complex to get your answers? Would you expect a substitution on one end of the complex to have consequences on the other end, that are not already captured by the consequences on nearby residues?

Will DR2 work to put in extreme values for influence_radius and max_edge_length during the QueryCollection to capture the graph representation of the entire protein (and complex)?

The code is setup such that it should work in theory, but as (mentioned above) it might be too slow to work in practice.

Is it reasonable to expect the ML to learn based on one protein? Usually these problems have many proteins in their dataset. I have 5k data points so I’m hoping for the best.

@gcroci2 is in a better place to answer this, so will leave it to her (also curious to hear whether you agree with what I said above).

Thanks again for making very useful software! The documentation and code was all put together very professionally!

Thanks, very glad to hear it 😊

0 replies

rgayatri · 2024-03-08T18:03:39Z

rgayatri
Mar 8, 2024
Maintainer

I agree with @DaniBodor here.
If you have the resources you could try multiple values for influence_radius and max_edge_length to find the optimal ones that work for your question.

Usually, I would recommend against using the whole protein complex (you can still try though), as in my experience the information gets lost in the noise. Am I correct in understanding that you have 5k data points for single amino acid substitutions in a single chain (~250 resi) for a tetramer? I suppose except for the substituted site much of your graph would be identical, it is possible that your model may overfit.
Ideally, the larger and diverse the dataset the better it is.

0 replies

gcroci2 · 2024-03-11T15:37:47Z

gcroci2
Mar 11, 2024
Maintainer

Nice to read our first DeepRank2 discussion 🎉🎉 Thanks @lloydtripp for your comments and questions!

I completely agree with @DaniBodor and @rgayatri, and I don't have much to add to their comments.
As for your question number 3., there is no unique answer I would say, but I am happy to reason with you. As @rgayatri said, especially if you include the entire protein complex-which is always the same in the 5k data points-it is very likely that the DL model overfits that protein. This problem can be (partially) solved by selecting a subregion of the protein around each variant (assuming that the 5k data points represent variants that are all different from each other), to train the network on different regions of the protein (rather than providing the entire structure for each data point). However, 5k data points related to one protein does not seem like a lot for this type of problem, but it can still be a good starting point. It also depends on your research question, and whether you are only interested in that protein or whether you want to implement something more general, which should work on other proteins as well.
Second, you may have already done this, but I would recommend comparing with the state of the art (of which I am not aware), to see how varied the datasets used are and what performance is achieved. You might also find links to ready-to-use databases that you can use yourself to increase your performance.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Structural Residue Variant Training on Single Protein in a Tetramer Complex #590

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Structural Residue Variant Training on Single Protein in a Tetramer Complex #590

lloydtripp Mar 7, 2024

Replies: 3 comments

DaniBodor Mar 8, 2024 Maintainer

rgayatri Mar 8, 2024 Maintainer

gcroci2 Mar 11, 2024 Maintainer

lloydtripp
Mar 7, 2024

DaniBodor
Mar 8, 2024
Maintainer

rgayatri
Mar 8, 2024
Maintainer

gcroci2
Mar 11, 2024
Maintainer