Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FEATURE - Detect tandem duplications with cigar #148

Draft
wants to merge 4 commits into
base: master
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions include/iGenVar.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@ struct cmd_arguments
/* -c */ clustering_methods clustering_method{hierarchical_clustering}; // default: hierarchical clustering
/* -r */ refinement_methods refinement_method{no_refinement}; // default: no refinement
// SV specifications:
/* -e TODO (irallia 19.08.21): duplication_errors */
/* -k */ uint64_t min_var_length = 30;
/* -l */ uint64_t max_var_length = 100000;
/* -m */ uint64_t max_tol_inserted_length = 50;
Expand Down
90 changes: 90 additions & 0 deletions include/modules/sv_detection_methods/analyze_cigar_method.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,96 @@

#include "structures/junction.hpp" // for class Junction

/*! \brief This function checks if the inserted bases are tandem duplicated.
*
* \param[in] config - configuration for a semi-gobal alignment
* \param[in] min_length - minimum length of variants to detect (default 30 bp,
* \param[in] sequence - suffix or prefix sequence
* \param[in] inserted_bases - the inserted bases of the possible duplication
* \param[in] is_suffix - true: suffix case, false: prefix case
* param[out] match_score - if we found a matching duplication, this value represents the maximal
* amount of matches with the reference and thus the length of the existing
* duplicated part on the reference. If there is no duplication, this value
* is 0.
* param[out] length_of_single_dupl_sequence - greatest common divisor of length of inserted sequence and length of
* maching part -> length of a single duplicated sequence
* \returns std::tie(match_score, length_of_single_dupl_sequence) - a tuple of the resulting values
*
* \details In this function the inserted bases are recursively aligned segment by segment until it has been proven that
* it is a real duplication.
* For simplicity, we only consider the suffix case here, since the prefix case works the same way:
*
* suffix_sequence AAAACCGCGTAGCGGGGCGGG
* ||||||||||
* inserted_bases GCGGGGCGGGGCGGG -> unmatched_inserted_bases: GCGGG
* -> match_score = 10
* -> length_of_single_dupl_sequence = gcd(15, 10) = 5
*
* suffix_sequence AAAACCGCGTAGCGGGGCGGG
* |||||
* unmatched_inserted_bases GCGGG
*
* -> match_score = 5, length_of_single_dupl_sequence = gcd(15, 5) = 5
*/
std::tuple<size_t, size_t> align_suffix_or_prefix(auto const & config,
int32_t const min_length,
std::span<const seqan3::dna5> & sequence,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
std::span<const seqan3::dna5> & sequence,
std::span<const seqan3::dna5> const sequence,

std::span<const seqan3::dna5> & inserted_bases,
bool is_suffix);

/*! \brief This function checks if the inserted bases are tandem duplicated.
*
* \param[in] query_sequence - SEQ field of the SAM/BAM file
* \param[in] length - length of inserted part, given by the CIGAR
* \param[in] pos_ref - position of the inserted part in the ref (current position)
* \param[in] pos_read - position of the inserted part in the read (current position)
* \param[in] inserted_bases - the inserted bases of the possible duplication
* \param[in] min_length - minimum length of variants to detect (default 30 bp,
* expected to be non-negative)
* \param[in, out] pos_start_dup_seq - start position of the duplicated seq (excluding itself)
* \param[in, out] pos_end_dup_seq - end position of the duplicated seq (including itself)
* \param[in, out] tandem_dup_count - the number of tandem copies of the inserted sequence
* \returns duplicated_bases - the duplicated bases of the duplication
*
* \details If the inserted bases include one or more copies of a duplicated sequence, which is suffix or prefix of
* another copy, we have a Tandem Duplication. To check this, we have to do a semi-global alignment.
*
* Case 1: The duplication (insertion) comes after the matched sequence. Thus we need to check if the inserted
* sequence matches (partly) the suffix sequence. The length of the matching part yields to the amount
* of copies, thus we can calculate the tandem_dup_count and the inserted_bases.
*
* ref AAAACCGCGTAGCGGG----------TACGTAACGGTACG
* |||||||||||||| |||||||| -> inserted sequence: GCGGGGCGGG
* read AACCGCGTAGCGGGGCGGGGCGGGTACGTAAC
*
* suffix_sequence AAAACCGCGTAGCGGG -> free_end_gaps_sequence1_leading{true},
* ||||| free_end_gaps_sequence1_trailing{false}
* inserted_bases GCGGGGCGGG -> free_end_gaps_sequence2_leading{false},
* free_end_gaps_sequence2_trailing{true}
* -> tandem_dup_count = 3, duplicated_bases = GCGGG
*
* Case 2: The duplication (insertion) comes before the matched sequence.
* ref AAAACCGCGTA----------GCGGGTACGTAACGGTACG
* ||||||||| ||||||||||||| -> inserted sequence: GCGGGGCGGG
* read AACCGCGTAGCGGGGCGGGGCGGGTACGTAAC
*
* prefix_sequence GCGGGTACGTAACGGTACG -> free_end_gaps_sequence1_leading{false},
* ||||| free_end_gaps_sequence1_trailing{true}
* inserted_bases GCGGGGCGGG -> free_end_gaps_sequence2_leading{true},
* free_end_gaps_sequence2_trailing{false}
* -> tandem_dup_count = 3, duplicated_bases = GCGGG
Comment on lines +63 to +82
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other Idea:
create suffix tree of the inserted sequence and search for longest common repeated substring without overlap (with errors) and than map this repeated substring (without errors?).

Other input: Burrows Wheeler, occurence table, FM index; reg Expression -> build minimal automat; ZIP Hoffmann code

* \see For some complex examples see detection_test.cpp.
*/
std::span<seqan3::dna5 const> detect_tandem_duplication(seqan3::dna5_vector const & query_sequence,
int32_t length,
int32_t pos_ref,
int32_t pos_read,
std::span<seqan3::dna5 const> & inserted_bases,
int32_t const min_length,
int32_t & pos_start_dup_seq,
int32_t & pos_end_dup_seq,
size_t & tandem_dup_count);

/*! \brief This function steps through the CIGAR string and stores junctions with their position in reference and read.
*
* \param[in] read_name - QNAME field of the SAM/BAM file
Expand Down
Loading