Skip to content

SRLV Genotyping Tools

Robert J. Gifford edited this page Nov 27, 2024 · 8 revisions

Overview

The SRLV extension of Lentivirus-GLUE provides functionality for genotyping SRLV sequences via maximum likelihood. Genotyping can be performed on any sequence of adequate length (typically >300 nucleotides are required for confident assignment). Any genomic region can be genotyped using the approach implemented in SRLV-GLUE.

Classification is based on maximum likelihood clade assignment (MLCA) as implemented in GLUE. Sequences are classified into genotypes and subtypes defined via phylogenetic analysis of full-length reference genome sequences.

Maximum Likelihood Clade Assignment (MLCA) in Lentivirus-GLUE

Lentivirus-GLUE employs a robust genotyping method called Maximum Likelihood Clade Assignment (MLCA) to assign SRLV sequences to genotypes and lineages.

MLCA is based on the Evolutionary Placement Algorithm (EPA), a feature of the highly optimized RAxML software. RAxML typically generates complete phylogenetic trees from multiple sequence alignments, but EPA allows for efficient clade assignment by placing new sequences onto an existing reference tree without recalculating the entire phylogeny. This efficiency makes EPA well-suited for virus sequence clade assignment, forming the foundation of the MLCA method integrated into GLUE.

In GLUE, the MLCA process is implemented using the maxLikelihoodGenotyper and maxLikelihoodPlacer modules.

Example Usage in Lentivirus-GLUE

The genotyping process in Lentivirus-GLUE can be executed through the command-line interface. Below is an example of using the MLCA genotyping module:

Mode path: /project/lentivirus
GLUE> module srlvMaxLikelihoodGenotyper genotype sequence -w "sequenceID = 'HM449618'"

This command processes the sequences in the specified FASTA file and outputs the assigned genotype and subtype clades for each sequence:

+============================+====================+===================+
|         queryName          | genotypeFinalClade | subtypeFinalClade |
+============================+====================+===================+
| ncbi-nuccore-srlv/HM449618 | AL_TREE_SRLV_A     | AL_TREE_SRLV_A8   |
+============================+====================+===================+

The MLCA Algorithm

The MLCA algorithm operates in three stages: alignment, placement, and neighbor-weighting. Each stage plays a crucial role in accurately assigning query sequences to predefined clades.

  1. Alignment Stage:
    The first step involves aligning the query sequences to a reference set of SRLV sequences. This is achieved using the MAFFT software, specifically the --add and --keeplength options, which integrate query sequences into the existing multiple sequence alignment without altering the original alignment's structure. Each query sequence is aligned independently, ensuring that the alignment computations remain isolated for each sequence.

  2. Placement Stage:
    In the placement stage, the extended alignment from the previous step is combined with a fixed reference tree. For each query sequence, the algorithm identifies potential placements on the tree that maximize the likelihood of the extended tree structure. Using RAxML's EPA subsystem, the algorithm inserts the query sequence at various points on the tree, optimizing the branch lengths and positions to find the most likely placements. A small set of high-likelihood placements is retained for further analysis.

  3. Neighbor-Weighting Stage:
    The final stage of the MLCA algorithm is neighbor-weighting, which summarizes the placement results by calculating clade weightings for each query sequence. The algorithm evaluates the evolutionary distance between the query sequence and its closest neighboring reference sequences. Since these neighbors are already assigned to specific clades, their proximity provides evidence for the query sequence's clade assignment. The closer the neighbor, the stronger the evidence. The algorithm then assigns the query sequence to the clade if the calculated weight exceeds a predefined threshold.

    This neighbor-weighting mechanism relies on the evolutionary distances in the phylogenetic tree, where shorter branch lengths indicate closer genetic relationships. By focusing on nearby reference sequences, the algorithm effectively assigns query sequences to the most appropriate clades based on genetic similarity.

Benefits of Using MLCA for SRLV Genotyping

The integration of MLCA within Lentivirus-GLUE offers a powerful and efficient tool for SRLV genotyping. By leveraging the EPA feature of RAxML and the structured approach of MLCA, the method provides a high level of accuracy and computational efficiency, making it well-suited for large-scale sequence analysis in both research and clinical settings.


Clone this wiki locally