Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FEATURE - Detect tandem duplications with cigar #148

Draft
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

Irallia
Copy link
Collaborator

@Irallia Irallia commented Aug 5, 2021

Resolves #166
With this PR we can now detect tandem duplications in the CIGAR string. We only collect tandem duplications with no errors. In a follow up PR, we will allow errors aswell. Thus I wrote some TODOs in the code.

@Irallia Irallia self-assigned this Aug 5, 2021
@codecov
Copy link

codecov bot commented Aug 5, 2021

Codecov Report

Merging #148 (4a44217) into master (74ddfb5) will decrease coverage by 0.31%.
The diff coverage is 94.73%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #148      +/-   ##
==========================================
- Coverage   98.40%   98.09%   -0.32%     
==========================================
  Files          19       19              
  Lines         878      944      +66     
==========================================
+ Hits          864      926      +62     
- Misses         14       18       +4     
Impacted Files Coverage Δ
...ules/sv_detection_methods/analyze_cigar_method.cpp 95.83% <94.52%> (-4.17%) ⬇️
src/variant_detection/variant_output.cpp 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 74ddfb5...4a44217. Read the comment docs.

@Irallia Irallia force-pushed the FEATURE/detect_tandem_duplications_with_CIGAR branch 4 times, most recently from a0408ac to 7c8e74b Compare August 10, 2021 17:28
@Irallia Irallia force-pushed the FEATURE/detect_tandem_duplications_with_CIGAR branch 5 times, most recently from f8e8809 to 5cd7ed5 Compare August 19, 2021 14:26
@Irallia Irallia requested a review from marehr August 19, 2021 14:26
auto & res = *results.begin();
// TODO (irallia 17.8.21): The mismatches should give us the opportunity to allow a given amount of errors in the
// duplication.
size_t matches = res.score() % 100;
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

! modulo works wierd with negative values!

Comment on lines +63 to +82
* ref AAAACCGCGTAGCGGG----------TACGTAACGGTACG
* |||||||||||||| |||||||| -> inserted sequence: GCGGGGCGGG
* read AACCGCGTAGCGGGGCGGGGCGGGTACGTAAC
*
* suffix_sequence AAAACCGCGTAGCGGG -> free_end_gaps_sequence1_leading{true},
* ||||| free_end_gaps_sequence1_trailing{false}
* inserted_bases GCGGGGCGGG -> free_end_gaps_sequence2_leading{false},
* free_end_gaps_sequence2_trailing{true}
* -> tandem_dup_count = 3, duplicated_bases = GCGGG
*
* Case 2: The duplication (insertion) comes before the matched sequence.
* ref AAAACCGCGTA----------GCGGGTACGTAACGGTACG
* ||||||||| ||||||||||||| -> inserted sequence: GCGGGGCGGG
* read AACCGCGTAGCGGGGCGGGGCGGGTACGTAAC
*
* prefix_sequence GCGGGTACGTAACGGTACG -> free_end_gaps_sequence1_leading{false},
* ||||| free_end_gaps_sequence1_trailing{true}
* inserted_bases GCGGGGCGGG -> free_end_gaps_sequence2_leading{true},
* free_end_gaps_sequence2_trailing{false}
* -> tandem_dup_count = 3, duplicated_bases = GCGGG
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other Idea:
create suffix tree of the inserted sequence and search for longest common repeated substring without overlap (with errors) and than map this repeated substring (without errors?).

Other input: Burrows Wheeler, occurence table, FM index; reg Expression -> build minimal automat; ZIP Hoffmann code

*/
std::tuple<size_t, size_t> align_suffix_or_prefix(auto const & config,
int32_t const min_length,
std::span<const seqan3::dna5> & sequence,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
std::span<const seqan3::dna5> & sequence,
std::span<const seqan3::dna5> const sequence,

Copy link
Member

@marehr marehr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some comments :)

// TODO (irallia 17.8.21): The mismatches should give us the opportunity to allow a given amount of errors in the
// duplication.
size_t matches = res.score() % 100;
size_t mismatches = (res.score() - matches) * (-1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this the same as:

mismatches = floor(res.score() / 100) * 100;

?

Comment on lines +22 to +24
std::span<seqan3::dna5 const> & sequence,
std::span<seqan3::dna5 const> & inserted_bases,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const

auto & res = *results.begin();
// TODO (irallia 17.8.21): The mismatches should give us the opportunity to allow a given amount of errors in the
// duplication.
size_t matches = res.score() % 100;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can score be negative?

@Irallia Irallia force-pushed the FEATURE/detect_tandem_duplications_with_CIGAR branch from 5cd7ed5 to abbf672 Compare October 25, 2021 13:49
@Irallia Irallia force-pushed the FEATURE/detect_tandem_duplications_with_CIGAR branch from abbf672 to 5301059 Compare November 2, 2021 13:20
@Irallia Irallia force-pushed the FEATURE/detect_tandem_duplications_with_CIGAR branch from 5301059 to 4a44217 Compare January 28, 2022 12:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEATURE] Call Tandem Duplications from CIGAR string
2 participants