Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow bidirectional coverage filtering #16

Open
apcamargo opened this issue Aug 25, 2024 · 1 comment
Open

Allow bidirectional coverage filtering #16

apcamargo opened this issue Aug 25, 2024 · 1 comment
Labels
feature request New feature or request

Comments

@apcamargo
Copy link

When dereplicating sequences, it is crucial to cluster genomes that significantly overlap each other (i.e., high bidirectional coverage) to avoid clustering sequences where one is entirely contained within the other. For example:

idx1   idx2   id1     id2     tani       gani       ani        cov        num_alns   len_ratio
----   ----   -----   -----   --------   --------   --------   --------   --------   ---------
0      1      seq_1   seq_2   0.581627   0.412891   0.991158   0.416575   3          2.479452 
1      0      seq_2   seq_1   0.581627   1.000000   1.000000   1.000000   2          0.403315 

In the example above, it is there's no combination of --cov and --ani that prevent these two sequences from being clustered together. Although setting a high --tani cutoff could resolve this specific case, it would not be effective in more complex scenarios. For instance, if the goal is to connect pairs of genomes with a minimum bidirectional coverage of 75% and no minimum ANI, using --tani wouldn't work.

A related question: how does Clusty handle multiple edges between node pairs? For instance, with --metric ani --ani 0.1, both edges in the example would pass the filters. In such cases, how does Clusty determine the clustering weight? Does it use the maximum ANI value, the mean, the first/last value in the file, something else?

@aziele
Copy link
Contributor

aziele commented Sep 23, 2024

Thank you for the great suggestion! In Vclust v1.1.1, we've introduced query and reference coverage (qcov and rcov) in the output, allowing for bidirectional coverage filtering to handle cases like the one you described.

When it comes to clustering with multiple edges between node pairs, Clusty selects the maximum value for clustering (e.g., ANI if --metric ani). All clustering algorithms in Clusty, except for the Leiden algorithm, are threshold-based and do not rely on weights (i.e., a genome is clustered if it meets the similarity threshold with a centroid, closest, or furthest member).

For the Leiden algorithm, if you want to include coverage information in the edge weight, you can use the gani metric. This is calculated as the number of identical nucleotides divided by the length of the query sequence (i.e., ANI multiplied by query coverage), which aligns with the approach used by IMG/PR.

@aziele aziele added the feature request New feature or request label Sep 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants