Skip to content

Latest commit

 

History

History
53 lines (29 loc) · 3.49 KB

duplicate_marking_methods.MD

File metadata and controls

53 lines (29 loc) · 3.49 KB

Duplicate Marking Methods

Samtools markdup

(I checked the source code at https://github.com/samtools/samtools/blob/5fcc7a4fa822bb216030b6d6803e846f99b7ad41/bam_markdup.c to verify.)

  • Old version was named as "samtools rmdup"

  • It works with both single end and paired end sequencing data.

  • Excluded reads (by default): Secondary, supplementary, unmapped, QC failed, reads with read length bigger than 300 bases.

  • If the read is soft clipped, it calculates 5’ unclipped start position and uses it.

POS corresponds to the position of first CIGAR “consuming reference”. Soft clipping does not consume reference. Therefore if we have soft clipping at the beginning of the read, the POS is the position after soft clipping.

Source

  • They use a hashing method. In the hash they keep chromosome id, unclipped 5’ start position and orientation of the read.

If we use the option -d to give optical distance threshold, it uses this threshold to check the differences between x coordinates and y coordinates. For example: If we use -d 850; for this example Read01 x=100 y=200 Read02 x=900 y=1000 It will check x1-x2 AND y1-y2; they are smaller than or equal to our threshold it will mark them as optical duplicates using DT tag. In this case x1-x2=800 < 850 ,y1-y2=800 < 850. Note that it requires both xdiff and ydiff to be <= our threshold. These reads are optical duplicates to each other.

  • It does not take inter-tile duplicates into account. It does the comparisons within each tile (tile+swath combination).

  • Outputs estimated library size. Does not output a histogram (preseq format).

  • How it handles the paired end reads

Creates a hash of the current read and its pair Uses the unclipped start (or end based on orientation), reference id, orientation and whether the current read is leftmost of the pair Requires users to use samtools fixmate -m first. Markdup uses the tags fixmate sets: MC (mate cigar) and MS (mate score) to decide which read to mark as duplicate. Keeps the read with the highest score as “original”

Picardtools MarkDuplicates