(C) R. Das & Das laboratory, Stanford University 2018-2019
A Python package for modeling the statistical physics of RNA folding at the secondary structure level.
Goals:
- Code that is easy to read so that humans can easily extend it to model new RNA physics
- Code with numerous tests built in so that extensions are correct
- A package that can learn from the huge data sets our lab is collecting.
A separate C++ package zetafoldplus
with the same functionality and matching python bindings and likely up to 100x the speed is being developed separately in a private repository.
This code brings together features pioneered in (but scattered across) prior packages:
- Multi-strand calculations
- Circular RNAs
- Co-axial stacking
- True partition function calculations in
N^3
time - Base pair probability estimates
- Gradients of predicted observables with respect to energy model parameters, to enable learning from data
- Enumerative backtracking to get all structures and their Boltzmann weights
- Stochastic backtracking to get Boltzmann-sampled structures
- Minimum free energy structures
- Rapid calculation of gradients (mostly
N^2
) to enable efficient learning from large data sets - Modeling of ligand/protein binding to RNA hairpins and internal loops (coming soon)
- Modeling of protein binding to RNA single-stranded segments (coming soon)
- Generalized base pairs (e.g., both Watson-Crick and Sugar/Hoogsteen G-A pairs) (coming soon)
- 'Classic' Turner2004 & ContraFold parameters (coming soon)
This code also presents entirely new features, based on recent theoretical insights from R. Das & laboratory:
- Cross-checks based on computation of the partition function
N
different ways for each RNA. - Linear motifs identified by Rosetta or by crystallography as having favorable energy bonuses (coming soon)
- Loop penalties that rise like the logarithm of the number of loop nucleotides, still in
N^3
time (coming soon) - Parameters for chemically modified bases, and some modified backbones, based on Rosetta calculations (coming soon)
- Modeling of protein binding to RNA, including proper steric exclusion effects. (coming soon)
- Modeling of RNA tertiary contacts, through a novel iterative sampling method, Rosetta-calculated properties of the contacts, and efficient C_eff calculations. (coming soon)
- Tracking and propagation of estimated model uncertainties. (coming soon)
- Easy install through
sudo pip
(coming soon)
This code is being released with the MIT license. So you can distribute it with your code.
Clone this repository, and just type:
./zetafold.py
to run tests on a bunch of example sequences.
To run on tRNA(phe) from yeast and get a (pseudo)MFE structure:
./zetafold.py -s GCGGAUUUAGCUCAGUUGGGAGAGCGCCAGACUGAAGAUCUGGAGGUCCUGUGUUCGAUCCACAGAAUUCGCACCA --mfe
To get base pair probabilities for tRNA(phe) from yeast (takes about 2x the computation):
./zetafold.py -s GCGGAUUUAGCUCAGUUGGGAGAGCGCCAGACUGAAGAUCUGGAGGUCCUGUGUUCGAUCCACAGAAUUCGCACCA --bpp
To circularize:
./zetafold.py -s GCGGAUUUAGCUCAGUUGGGAGAGCGCCAGACUGAAGAUCUGGAGGUCCUGUGUUCGAUCCACAGAAUUCGCACCA --circle
To run on a multi-strand system, type:
./zetafold.py -s GCAACG CGAAGC
To re-run tRNA as a totally weird circular permutation:
./zetafold.py -s UGAAGAUCUGGAGGUCCUGUGUUCGAUCCACAGAAUUCGCACCA GCGGAUUUAGCUCAGUUGGGAGAGCGCCAGAC --circle
Should get the same answer as above linear case!
More information on making contributions coming soon.