Engineering of enzymes through UniRep machine learning and in silico evolution, targeting calvin cycle enzymes fructose-1,6-bisphosphatase/sedoheptulose-1,7-bisphosphatase (FBPase/SBPase), ribulose-1,5-bisphosphate carboxylase/oxygenase (Rubisco), and more.
Evotuning was performed on a GCP VM with two vCPUs, 13 GB RAM, and one NVIDIA Tesla T4 GPU with 16 GB VRAM. Other tasks were performed on Linux systems with 16 cores and 128 GB RAM (Ubuntu 18.04.5 LTS), and 12 cores, 32 GB RAM, and an NVIDIA RTX 2070 SUPER GPU with 8 GB VRAM (Ubuntu 20.04.3 LTS). Top model fitting and in silico evolution (Usage steps 4-6) may be performed on a regular laptop.
Programs
Software | Version | Tested version | Note |
---|---|---|---|
Linux OS | Ubuntu 18.04.5 LTS and 20.04.3 LTS | ||
Bash | 4.0 | 4.4.20, 5.0.17 | |
Python | 3.7 | 3.7.6, 3.8.3 | |
R | 3.6.3 | 3.6.3, 4.1.1 | |
GNU parallel | 20161222 | 20161222 | |
hmmer | 3.1b2 | 3.1b2 | |
cd-hit | 4.8.1 | 4.8.1 | |
pigz | 2.4 | 2.4 | Optional; Only needed for phateR scripts. |
fasttreeMP | 2.1.11 | 2.1.11 | Optional; Not needed for Usage. |
Python libraries
Library | Version | Tested version | Note |
---|---|---|---|
BioPython | 1.76 | 1.76, 1.77 | |
python-levenshtein | 0.12.0 | 0.12.0 | |
jax | 0.2.5 | 0.2.5, 0.2.24 | |
jax-unirep | 2.1.0 | 2.1.0 | |
numpy | 1.18.1 | 1.18.1, 1.18.5 | |
pandas | 0.25.3 | 0.25.3, 1.0.5 | |
scipy | 1.4.1 | 1.4.1, 1.5.0 | |
scikit-learn | 0.22.1 | 0.22.1, 0.23.1 |
R libraries
Library | Version | Tested version | Note |
---|---|---|---|
tidyverse | 1.3.1 | 1.3.1 | |
egg | 0.4.5 | 0.4.5 | |
doMC | 1.3.7 | 1.3.7 | |
foreach | 1.5.1 | 1.5.1 | |
phateR | 1.0.7 | 1.0.7 | Optional; Not needed for Usage, only phateR scripts. |
phytools | 0.7-90 | 0.7-90 | Optional; Not needed for Usage. |
ggtree | 3.0.4 | 3.0.4 | Optional; Not needed for Usage. |
Download the FUREE repository from GitHub and enter the directory (should take less than a minute, depending on the internet connection):
git clone https://github.com/Asplund-Samuelsson/furee.git
cd furee
The analysis uses the user-friendly JAX implementation of UniRep named jax-unirep. It may be installed from PyPI as described below (see the jax-unirep GitHub repository for more details):
pip3 install jax-unirep
To enable CUDA GPU support, you may need to install the correct JAX packages; see instructions in the JAX repository.
This repository includes two evotuned models, for FBPase (results/FBPase/evotuned/iter_final
) and Rubisco (results/Rubisco/evotuned/iter_final
). Therefore the UniProt and NCBI taxonomy databases are only needed if you want to carry out steps 1-3 in the Usage section.
Training sequences will be extracted from the UniProt database.
Install the full UniProt database (required for meaningful evotuning).
Run these commands to download the necessary FASTA files for SwissProt and TrEMBL, constituting UniProt (may take several hours):
cd data/uniprot
wget ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz
wget ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_trembl.fasta.gz
zcat uniprot_sprot.fasta.gz uniprot_trembl.fasta.gz > uniprot.fasta
rm uniprot_sprot.fasta.gz uniprot_trembl.fasta.gz # Optional cleanup
cd ../..
Install the SwissProt database (recommended for testing).
Run these commands to download the necessary FASTA file for SwissProt (should take less than a minute):
cd data/uniprot
wget ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz
zcat uniprot_sprot.fasta.gz > uniprot.fasta
rm uniprot_sprot.fasta.gz # Optional cleanup
cd ../..
This analysis uses the NCBI taxonomy database to assign taxonomic classifications to UniProt sequences.
Install the NCBI taxonomy database (recommended for testing).
Run these commands to download the necessary names.dmp
and nodes.dmp
files (should take less than a minute):
cd data/ncbi/taxonomy
wget https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz
tar -zxvf taxdump.tar.gz names.dmp nodes.dmp
rm taxdump.tar.gz # Optional cleanup
cd ../../..
This guide describes how to get training data, carry out evotuning, fit a top model, and perform in silico evolution using FUREE.
Evotuned models for FBPase and Rubisco are provided (results/FBPase/evotuned/iter_final/
and results/Rubisco/evotuned/iter_final/
), so if you wish you may skip the first three steps and go directly to Fit a top model.
The evotuning training data are UniProt sequences extracted through iterative JackHMMer searches with a small, but diverse, set of query sequences related to the protein to be evolved in silico. When preparing the training data in the next step, the input is a FASTA file with such query sequences.
FUREE offers a programmatic approach to obtaining query sequence suggestions in the form of UniProt sequences representing KEGG orthologs (KOs). The program takes a comma-separated list of KOs (option -k
) and/or a file with one KO per line (option -i
) and saves the corresponding UniProt sequences to a FASTA file.
FBPase:
bash source/uniprot_sequences_from_KO.sh \
-i data/FBPase_KOs.txt \
-o intermediate/FBPase_KOs.fasta
Rubisco:
bash source/uniprot_sequences_from_KO.sh \
-k K01601 \
-o intermediate/Rubisco_KOs.fasta
Since we want a manageable number of sequences as queries for the JackHMMer search, we need to reduce them to cluster representatives based on sequence identity.
FBPase:
cd-hit -c 0.5 -n 2 \
-i intermediate/FBPase_KOs.fasta \
-o intermediate/FBPase_KOs.cdhit.fasta
Rubisco:
cd-hit -c 0.7 -n 5 \
-i intermediate/Rubisco_KOs.fasta \
-o intermediate/Rubisco_KOs.cdhit.fasta
Finally, we add another query sequence, e.g. the target FBPase or Rubisco from Synechocystis, that we think might be particularly important.
FBPase:
cat \
intermediate/FBPase_KOs.cdhit.fasta \
data/Syn6803_P73922_FBPase.fasta \
> intermediate/FBPase_queries.fasta
Rubisco:
cat \
intermediate/Rubisco_KOs.cdhit.fasta \
data/Syn6803_P54205_Rubisco.fasta \
> intermediate/Rubisco_queries.fasta
The query sequences are used for JackHMMer searches against the UniProt sequence database and subsequently filtered (see table below for more details). We run a preparation script with the query sequences, the in silico evolution target sequence, redundancy-filtering identity fraction, number of MADs defining allowed length range, the Levenshtein distance cutoff, and direct the output to a directory of choice.
Note 1: The training data preparation requires the UniProt database (data/uniprot/uniprot.fasta
) and the NCBI taxonomy database (names.dmp
and nodes.dmp
in data/ncbi/taxonomy/
). See Installation.
Note 2: Because of JackHMMer, the preparation may take several days if searching the full UniProt database. For testing the preparation script it is recommended to use only SwissProt, which can be searched in just a few minutes. See Installation.
FBPase:
bash source/preparation.sh \
intermediate/FBPase_queries.fasta \
data/Syn6803_P73922_FBPase.fasta \
1.0 1 300 \
results/FBPase
Rubisco:
bash source/preparation.sh \
intermediate/Rubisco_queries.fasta \
data/Syn6803_P54205_Rubisco.fasta \
0.99 1.5 400 \
results/Rubisco
# | Output | Description |
---|---|---|
1 | jackhmmer/ |
Use JackHMMer at default settings and a maximum of five iterations to find candidate UniProt sequences for evotuning. Require full sequence E-value < 0.01, best 1-domain E-value < 0.03, and number of hits within three median absolute deviations from the median (Hampel filter). |
2 | cdhit/ |
Use CD-HIT to make identified sequences unique. |
3 | aa/ |
Require that sequences consist of only standard amino acids. |
4 | distance/ |
Require that sequences show length within the defined number of median absolute deviations from the median, and then Levenshtein distance to target less than or equal to the defined value. |
5 | taxonomy/ |
Assess taxonomic distribution of training data. |
6 | train.txt |
Save training sequences to file with one sequence per line. |
hits.txt |
Summarize UniProt annotations of training sequences. |
The final unique and filtered training sequences are stored in the train.txt
file in the output directory.
Rubisco:
results/Rubisco/train.txt
Taxonomic distribution and Levenshtein distance to target Rubisco |
---|
The UniRep model weights, or parameters, must be re-trained, or evotuned, to the local evolutionary context of our in silico evolution target sequence. To do se we supply our training sequences to the evotuning script.
Note 1: Evotuning should be run on a GPU. Training a model using the CPU (option --cpu
) is very slow, and is even likely to stall, misbehave, or crash.
Note 2: If running this on GCP, it is necessary to run screen
before to allow continued activity after disconnecting. Use Ctrl-A and Ctrl-D to detach the screen and keep it running.
Evotune with FBPase training sequences.
python3 source/evotune.py \
--epochs 100 --validation 0.05 \
--step 1e-5 --batch 128 --dumps 1 \
results/FBPase/train.txt \
results/FBPase/evotuned &
Evotune with Rubisco training sequences.
python3 source/evotune.py \
--epochs 100 --validation 0.05 \
--step 1e-5 --batch 128 --dumps 1 \
results/Rubisco/train.txt \
results/Rubisco/evotuned &
The evotuning produces various metadata, which are listed below.
Output | Description |
---|---|
validation_sequences.txt |
Validation sequences, one per line. |
evotuning.log |
Log with loss values and timings. |
evotuning.tab |
Training and validation losses in tab-delimited format. |
evotuning.png |
Plot of loss development across epochs. |
The evotuned model parameters are saved in a Python Pickle file.
The validation sequences were transformed into UniRep representations (1,900 floating point values per protein) and then subjected to PHATE dimensionality reduction to visualize the "sequence landscape" as seen by the original UniRep model, and the models trained with FBPase and Rubisco sequences. The following scripts made this possible:
source/phate_fbpase_rubisco.sh
source/get_representations.py
source/phate_fbpase_rubisco.R
A top model leverages the protein representations produced by evotuned UniRep parameters to predict performance of novel sequences.
To fit a top model, it is necessary to provide sequences and associated values that are to be improved through directed evolution. Sequences and values should be saved in a tab-delimited format as in e.g. data/FBPase_dummy.train.tab
(here we look only at the last 60 characters of each line to save space):
head -5 data/FBPase_dummy.train.tab | rev | cut -c 1-60 | rev
CGITPGTLMQGVQFFHNGARTQSLVISSQSRTARFVDTIHMFDKLEYVQLR 0.161198
CGITPGTLMEGVRFFHGGARTQSLVISSQSKTARFVDTVHMTDQPKTIQLK 0.078044
CGITPGTLMEGVRFFHGGARTQSLVISSQSKTARFVDTIHMFDQPKSIQLR 0.017744
CGITPGSLMEGVRFFGGGARTQSLVISNQSQTARFVDTIHLFDNVKSLQLR 0.173089
CGITPGTLMEGVRFFKGGARTQSLVISSQSQTARFVDTIHMFEEPKVLQLR 0.246331
We fit the Ridge Regression Sparse Refit top model using evotuned UniRep parameters for the underlying representations.
FBPase:
python3 source/train_top_model.py \
-p results/FBPase/evotuned/iter_final \
data/FBPase_dummy.train.tab \
intermediate/FBPase_dummy.top_model.pkl
Rubisco:
python3 source/train_top_model.py \
-p results/Rubisco/evotuned/iter_final \
data/Rubisco_dummy.train.tab \
intermediate/Rubisco_dummy.top_model.pkl
For additional options, refer to the help:
python3 source/train_top_model.py --help
Predictions with an already fitted top model can be made on sequences in a file with one sequence per line.
FBPase:
fold -w 80 -s data/Syn6803_P73922_FBPase.txt
MDSTLGLEIIEVVEQAAIASAKWMGKGEKNTADQVAVEAMRERMNKIHMRGRIVIGEGERDDAPMLYIGEEVGICTREDA
KSFCNPDELVEIDIAVDPCEGTNLVAYGQNGSMAVLAISEKGGLFAAPDFYMKKLAAPPAAKGHVDIDKSATENLKILSD
CLNRSIEELVVVVMDRPRHKELIQEIRNAGARVRLISDGDVSAAISCAFSGTNIHALMGIGAAPEGVISAAAMRCLGGHF
QGQLIYDPEVVKTGLIGESREGNLERLASMGIKNPDQVYNCEELACGETVLFAACGITPGTLMEGVRFFHGGVRTQSLVI
SSQSSTARFVDTVHMKESPKVIQLH
python3 source/top_model_prediction.py \
-p results/FBPase/evotuned/iter_final \
data/Syn6803_P73922_FBPase.txt \
intermediate/FBPase_dummy.top_model.pkl \
results/FBPase/Syn6803_P73922_FBPase.prediction.tab
The prediction for FBPase is expected to be 0.16359496894461598
.
Rubisco:
fold -w 80 -s data/Syn6803_P54205_Rubisco.txt
MVQAKAGFKAGVQDYRLTYYTPDYTPKDTDLLACFRMTPQPGVPAEEAAAAVAAESSTGTWTTVWTDNLTDLDRYKGRCY
DLEAVPNEDNQYFAFIAYPLDLFEEGSVTNVLTSLVGNVFGFKALRALRLEDIRFPVALIKTFQGPPHGITVERDKLNKY
GRPLLGCTIKPKLGLSAKNYGRAVYECLRGGLDFTKDDENINSQPFMRWRDRFLFVQEAIEKAQAETNEMKGHYLNVTAG
TCEEMMKRAEFAKEIGTPIIMHDFFTGGFTANTTLARWCRDNGILLHIHRAMHAVVDRQKNHGIHFRVLAKCLRLSGGDH
LHSGTVVGKLEGERGITMGFVDLMREDYVEEDRSRGIFFTQDYASMPGTMPVASGGIHVWHMPALVEIFGDDSCLQFGGG
TLGHPWGNAPGATANRVALEACVQARNEGRNLAREGNDVIREACRWSPELAAACELWKEIKFEFEAMDTL
python3 source/top_model_prediction.py \
-p results/Rubisco/evotuned/iter_final \
data/Syn6803_P54205_Rubisco.txt \
intermediate/Rubisco_dummy.top_model.pkl \
results/Rubisco/Syn6803_P54205_Rubisco.prediction.tab
The prediction for Rubisco is expected to be 0.1516921772674731
.
For additional options, refer to the help:
python3 source/top_model_prediction.py --help
The in silico evolution is carried out using a set of evotuned parameters, a top model, and one starting sequence.
FBPase:
python3 source/in_silico_evolution.py \
-s 50 -t 15 \
-p results/FBPase/evotuned/iter_final \
data/Syn6803_P73922_FBPase.txt \
intermediate/FBPase_dummy.top_model.pkl \
results/FBPase/Syn6803_P73922_FBPase.evolved.tab
Rubisco:
python3 source/in_silico_evolution.py \
-s 50 -t 15 \
-p results/Rubisco/evotuned/iter_final \
data/Syn6803_P54205_Rubisco.txt \
intermediate/Rubisco_dummy.top_model.pkl \
results/Rubisco/Syn6803_P54205_Rubisco.evolved.tab
The output tab-delimited file contains evolved sequences (column sequences
), predicted values for each sequence (scores
), status of acceptance for the next iteration in the evolution algorithm (accept
), and the step (step
).
For additional options, refer to the help:
python3 source/in_silico_evolution.py --help
We try to use multiple top models to guide the in silico evolution by extending the function in JAX-UniRep that decides whether a sequence is accepted (is_accepted
) to an arbitrary number of scores.
First we create tables with mutant data for 15 Synechocystis FBPase sequences from Feng et al. 2013:
python3 source/feng_mutants.py
data/FBPase_km.train.tab
data/FBPase_kcat.train.tab
Then we train three top models (tree height, km, and kcat):
mkdir intermediate/top_models
python3 source/train_top_model.py \
-p results/FBPase/evotuned/iter_final \
data/FBPase_dummy.train.tab \
intermediate/top_models/height.pkl
python3 source/train_top_model.py \
-p results/FBPase/evotuned/iter_final \
--max_alpha 3 \
data/FBPase_km.train.tab \
intermediate/top_models/km.pkl
python3 source/train_top_model.py \
-p results/FBPase/evotuned/iter_final \
--max_alpha 3 \
data/FBPase_kcat.train.tab \
intermediate/top_models/kcat.pkl
Note: It might be necessary to adjust the regularization alpha range to not over-regularize the top model. This was done for Km and kcat values in this example by capping regularization at alpha 103 instead of the default 106.
Finally, we may perform in silico evolution using three top models simultaneously:
python3 source/in_silico_evolution.py \
-s 500 -t 15 -T 0.001 \
-p results/FBPase/evotuned/iter_final \
data/Syn6803_P73922_FBPase.txt \
intermediate/top_models \
results/FBPase/Syn6803_P73922_FBPase.multi_evolved.tab
Note 1: Numpy might report RuntimeWarning: overflow encountered in exp
in the is_accepted
function during the MCMC sampling. This should be fine as it is associated with raising e to the power of a large negative number and will yield 0.0
, which is adequate.
Note 2: To make the evolution across multiple top models more even, it might be necessary to modify the MCMC temperature (-T
) and the number of steps (-s
).
With some shell code we may take a look at the accepted sequences in the output table, cutting off each column for convenience:
(
head -1 results/FBPase/Syn6803_P73922_FBPase.multi_evolved.tab
paste \
<(
grep True results/FBPase/Syn6803_P73922_FBPase.multi_evolved.tab | \
awk '{OFS="\t"} {for(i=1;i<=NF;i++) $i=substr($i,1,42)}1' | cut -f 1
) \
<(
grep True results/FBPase/Syn6803_P73922_FBPase.multi_evolved.tab | \
awk '{OFS="\t"} {for(i=1;i<=NF;i++) $i=substr($i,1,6)}1' | cut -f 2-
)
) | column -tn -s $'\t'
sequences km height kcat accept step
MDSTLGLEIIEVVEQAAIASAKWMGKGEKNTADQVAVEAMRE 0.4403 0.1635 4.5727 True 0
MDSTLGLEIIEVVEQAAIASAKWMGKGEKNTADQVAVEAMRE 0.4416 0.1649 4.7796 True 5
MDSTLGLEIIEVVEQAAIASAKWMGKGEKNTADQVAVEAMRE 0.4712 0.1672 4.9850 True 13
MDSTLGLEIIEVVEQAAIASAKWMGKGEKNTADQVAVEAMRE 0.4919 0.1692 5.0271 True 42
MDSTLGLEIIEVVEQAAIASAKWMGKGEKNTADQVAVEAMRE 0.4958 0.1707 5.0537 True 56
MDSTLGLEIIEVVEQAAIASAKWMGKGEKNTADQVAVEAMRE 0.5016 0.1710 5.0873 True 98
MDSTLGLEIIEVVEQAAIASAKWMGKGEKNTADQVAVEAMRE 0.5075 0.1761 5.1062 True 116
MDSTLGLEIIEVVEQAAIASAKWMGKGEKNTADQVAVEAMRE 0.5281 0.1762 5.2091 True 148
MDSTLGLEIIEVVEQAAIASAKWMGKGEKNTADQVAVEAMRE 0.5307 0.1778 5.2345 True 152
MDSTLGLEIIEVVEQAAIASAKWMGKGEKNTADQVAVEAMRE 0.5307 0.1778 5.2345 True 153
MDSTLGLEIIEVVEQAAIASAKWMGKGEKNTADQVAVEAMIE 0.5476 0.1895 5.2951 True 154
MDSTLGLEIIEVVEQAAIASAKWMGFGEKNTADQVAVEAMIE 0.5501 0.2217 5.9309 True 161
MDSTLGLEIIEVVEQAAIASAKWMGFGEKNTADQVAVEAMIE 0.5507 0.2214 6.0149 True 166
MDSTLGLEIIEVVEQAAIASAKWMGFGEKNTADQVAVEAMIE 0.5554 0.2230 6.0231 True 216
MDSTLGLEIIEVVEQAAIASAKWMGFGEKNTADQVAVEAMIE 0.5572 0.2255 6.0607 True 233
MDSTLGLEIIEVVEQAAIASAKWMGFGEKNTADQVAVEAMIE 0.5673 0.2363 6.0729 True 259
MDSTLGLEIIEVVEQAAIASAKWMGFGEKNTADQVAVEAMIE 0.5675 0.2394 6.0828 True 350
Note 1: There are R41I and K26F mutations at steps 154 and 161.
Note 2: The first sequence is the wildtype Synechocystis FBPase. It is supposed to have a Km of 0.08 and a kcat of 10.5, so these top models are clearly performing suboptimally.
Comparing the score(s) of a proposed mutated sequence with the score(s) of the current best sequence is by default done by looking at the difference. This should be fine regardless of score magnitude when using a single score, but might be biasing the evolution progress under multiple scores of different magnitudes. Specifically, even a small fractional reduction in score of a higher magnitude would tend to generate more values close to zero in the is_accepted function (raises e to the power of the difference divided by the temperature). Values close to zero are unlikely to be higher than the randomly sampled value between 0 and 1, which is required for acceptance. In other words, slight score reductions for high magnitude scores are more likely to be rejected than those for low magnitude scores. Therefore the option to use a ratio for comparison (option --ratio
, -R
) instead of the difference was introduced.
To investigate the effect of difference and ratio comparison methods, as well as different temperature settings, multiple in silico evolution runs were carried out. Eight replicate in silico evolution chains were performed over 200 steps, with temperatures 0.003, 0.01, 0.03, 0.1, and 0.3, using difference or ratio as the comparison method:
bash source/diff_vs_ratio.sh
The results (results/diff_vs_ratio/
) were visualized using R:
Rscript source/diff_vs_ratio.R
Evotuning using FBPase sequences was performed and evaluated in order to get acquainted with the JAX-UniRep framework and develop the tools in this repository. The steps to acquire example FBPase sequences and then evotune the UniRep model are described in source/fbpase_evotuning_example.sh
.
Note: The analysis described below used an earlier version of JAX-UniRep and FUREE, and may be incompatible with the current versions.
The Jackhmmer search for training sequences in UniProt begins with a set of known target sequence relatives. We obtain this set from KEGG orthologs K01086 and K11532 by following the instructions in this bash script:
source/get_FBPase_sequences.sh
...yielding this FASTA file with initial reference protein sequences:
data/kegg_uniprot.FBPase.fasta
Jackhmmer was used to find 99,327 example sequences in the UniProt database. These sequences were mostly bacterial.
The evotuning held out 4,967 sequences for validation and optimized parameters using the remaining 94,360 sequences. The losses reported by the JAX-UniRep fit function for the training and validation sequences were plotted for 100 epochs of training with learning rate 1e-5 and batch size 128. The final iteration weights were accepted for future use since overfitting was not evident.
A set of dummy stability (Tm) values were generated to facilitate development of the top model and in silico evolution scripts. First, a phylogenetic tree was constructed based on 63 UniProt FBPase sequences (>85% identity) including the sequence in Synechocystis sp. PCC 6803. Ancestral sequence reconstruction was performed using the FireProt-ASR server. Changes in Tm were sampled going from the root sequence (Tm = 80°C) assuming decreasing stability with a target of Tm = 55°C at the furthest tip of the tree. The process was carried out as described in these scripts:
source/generate_dummy_Tm_data.sh
source/generate_dummy_ancestral_Tm.R
...finally yielding this table of 125 sequence-to-dummy-stability associations:
data/dummy.sequence_Tm.tab
...which represents typical input for training a top model.
The dummy Tm values are visualized in data/dummy_ancestral_Tm_on_tree.pdf
. Unfortunately, it turned out that the dummy stability values generated were purely noise and could not be used for development of the top model.
Instead of the failed stability values, the tree height (distance from the root) of each node (sequence) was used as detailed in these scripts:
source/evaluate_top_model.sh
source/evaluate_top_model.R
...that generated and evaluated the following height data:
data/FBPase_dummy.train.tab
data/FBPase_dummy.test.tab
Splitting the set of ancestral and contemporary sequences into 63 training sequences and 62 testing sequences allowed development of a Ridge regression sparse refit (SR) top model used to predict height in the phylogenetic tree. The top model showed test RMSE ≈ 0.0339 using original UniRep mLSTM parameters and test RMSE ≈ 0.0263 using evotuned parameters.
The validation sequences were transformed into representations using the original UniRep parameters as well as the freshly evotuned FBPase-specific parameters. The representations were in turn used to visualize the sequence landscape using PHATE. The evotuned landscape was distinct from the original UniRep landscape; Possibly smoother and hopefully more information rich.
The Synechocystis sp. PCC 6803 FBPase sequence was subjected to in silico evolution (trust radius 15, 50 steps) guided by the evotuned mLSTM and the top model fitted on tree height values. Subsequently, the 80 best evolved sequences were separately aligned with the original 63 example sequences from "Evaluation of a top model". Trees were then constructed to estimate the change in tree height of the evolved sequences. This showed that the evotuned parameters lead to stronger evolution in the forward direction, and rightfully made more conservative claims for evolution in the reverse direction (which appeared to be impossible).
Representations were obtained for five FBPase sequences using original and evotuned UniRep parameters. The sequences are Oligotropha carboxidovorans OM5 cbbF and glpX, Ralstonia eutropha H16 cbbF2 and fbp, and Synechocystis sp. PCC 6803 FBPase. The 1,900 representation values were ordered and plotted so that large standard deviations were located to the center of one 38 by 50 pixel image per sequence.
Johannes Asplund-Samuelsson (johannes.aspsam@gmail.com)