-
Notifications
You must be signed in to change notification settings - Fork 3
Parse CIGAR and MD:Z string to get mutations (in Python3, with examples)
In this post, I would like to summary about how to get mutations from the SAM files using CIGAR string and MD:Z string. All the functions here can be found in the python script locate_mut.py
- How to read CIGAR string Link
- MD:Z tag:
MD:Z:[0-9]+(([A-Z]|^[A-Z]+)[0-9]+)* String for mismatching positions. The MD field aims to achieve SNP/indel calling without looking at the reference. For example, a string ‘10A5^AC6’ means from the leftmost reference base in the alignment, there are 10 matches followed by an A on the reference which is different from the aligned read base; the next 5 reference bases are matches followed by a 2bp deletion from the reference; the deleted sequence is AC; the last 6 bases are matches. The MD field ought to match the CIGAR string.
- For MD:Z tag, it ignores insertions. We can combine these two to call mutations in a given read
- Insertion:
[ref_pos|inserted_bases|ins]
- Deletion:
[ref_pos|deleted_bases|del]
- SNP:
[ref_pos|ref_base|alt_base]
- We can see if there is an insertion by checking the CIGAR tag, if there's no insertion in a read, we can get all the mutations solely from the MD:Z tag.
- Note here all the coordinates in the output list are 1-based.
- In this example, base 141 of this read was deleted. that would convert to the reference base 1939
Example 1:
## cigar -- [(140, 'M'), (1, 'D'), (9, 'M')]
## MD:Z -- 140^G9
# reference read
CTTGTCAATGTGAAGGGTGAAAACATCACCAATGCCCCTGAACTGCAGCCGAATGCTGTCACTTGGGGCATCTTCCCTGGGCGAGAGATCATCCAGCCCACCGTAGTGGATCCCGTCAGCTTCATGTTCTGGAAGGACGAGGCCTTTGC
# sam read pos: 1799
CTTGTCAATGTGAAGGGTGAAAACATCACCAATGCCCCTGAACTGCAGCCGAATGCTGTCACTTGGGGCATCTTCCCTGGGCGAGAGATCATCCAGCCCACCGTAGTGGATCCCGTCAGCTTCATGTTCTGGAAGGACGAGCCTTTGCC
## mutations: ['1939|G|del']
- In this example, base 124 was deleted, 5bp after that, 3 bases were changed (ref:TGG)
Example 2:
## cigar -- [(123, 'M'), (1, 'D'), (27, 'M')]
## MD:Z -- 123^A5T0G0G19
# reference read
CTTGTCAATGTGAAGGGTGAAAACATCACCAATGCCCCTGAACTGCAGCCGAATGCTGTCACTTGGGGCATCTTCCCTGGGCGAGAGATCATCCAGCCCACCGTAGTGGATCCCGTCAGCTTCATGTTCTGGAAGGACGAGGCCTTTGCC
# sam read - pos: 1799
CTTGTCAATGTGAAGGGTGAAAACATCACCAATGCCCCTGAACTGCAGCCGAATGCTGTCACTTGGGGCATCTTCCCTGGGCGAGAGATCATCCAGCCCACCGTAGTGGATCCCGTCAGCTTCTGTTCGTTAAGGACGAGGCCTTTGCCC
# mutations: ['1922|A|del', '1928|T|G', '1929|G|T', '1930|G|T']
- In this example, base 69 was changed from
C
toA
, followed by 73 bases match and 1bp deletion
Example 3:
## CIGAR -- [(142, 'M'), (1, 'D'), (7, 'M')]
## MDZ -- 68C73^C7
# reference read
CTTGTCAATGTGAAGGGTGAAAACATCACCAATGCCCCTGAACTGCAGCCGAATGCTGTCACTTGGGGCATCTTCCCTGGGCGAGAGATCATCCAGCCCACCGTAGTGGATCCCGTCAGCTTCATGTTCTGGAAGGACGAGGCCTTTGCC
# sam read - pos: 1799
CTTGTCAATGTGAAGGGTGAAAACATCACCAATGCCCCTGAACTGCAGCCGAATGCTGTCACTTGGGGAATCTTCCCTGGGCGAGAGATCATCCAGCCCACCGTAGTGGATCCCGTCAGCTTCATGTTCTGGAAGGACGAGGCTTTGCC
# mutations: ['1867|C|A', '1941|C|del']
- how to divide CIGAR into a list of tuples
# can also do it with re
import cigar
CIGAR = "142M1D7M"
cigar_list = list(cigar.Cigar(CIGAR).items())
print(cigar_list)
# [(142, 'M'), (1, 'D'), (7, 'M')]
- Function to locate mutations in a given Sam file entry
def parse_mdz(cigar, mdz_raw, ref, read, pos, qual):
"""
cigar: CIGAR string from SAM file [(142, 'M'), (1, 'D'), (7, 'M')]
mdz: MD:Z tag from SAM file '68C73^C7'
ref: Reference sequence (template sequence)
read: Read from SAM file
read_pos: Mapping position of this read. Where the first base maps to
qual: Quality score sequence of this read
Get mutation from MD:Z string
Insertions are not represented by MD:Z tag
Insertions are processed from CIGAR string
"""
# check soft clipped bases
# soft clipped means that the bases on the read were clipped when mapping
clip = 0
# We can use this to adjust the starting point of the sequence
# remove soft clipped bases and adjust for location
if cigar[0][1] =="S": # it always occurs in the beginning or at the end
# adjust read position
clip += cigar[0][0]
# user regular expression to split the MD:Z string into chunks with mutations
# i.e `13C14^T4T0G46` -- ['13C', '14^T', '4T', '0G']
mdz = re.findall('.*?[.ATCG]+', mdz_raw)
r = re.compile("([0-9]+)([a-zA-Z\^]+)")
read_pos = 0 + clip # track positions on the read, it starts after the clipped region
mut_list = [] # store mutations
deleted_len = 0 # track number of bases that were deleted
for i in mdz: # parsing the read from left to right
# for each item in the MD:Z string
# split the item into number and letter
m = r.match(i)
match_len = int(m.group(1))
base = m.group(2)
if "^" not in base:
# this means a single nt change
# base = reference base
# pos+read_pos-clip = read starting point + # of bp mapped since that point - clipped base (because they are not on reference sequence)
# read_pos-deleted_len = # of bases mapped since the beginnig of this read - number of bases deleted
read_pos += match_len # update read position as we are moving to the right
mut_list.append(str(pos+read_pos-clip)+"|"+base+"|"+read[read_pos-deleted_len])
read_pos += 1 # add 1 for the current base
else: # deletion
read_pos += match_len # update read position as we are moving to the right
# pos+read_pos-clip = read starting point + # of bp mapped since that point - clipped base (because they are not on reference sequence)
# base[1:] = bases that were deleted
mut_list.append(str(pos+read_pos-clip)+"|"+base[1:]+"|del")
# update number of deleted bases
deleted_len += len(base[1:])
read_pos += len(base[1:])
return mut_list
-
In this case, we need to parse CIGAR string with MD:Z tag to find insertions
-
The MDZ part is slightly different because we also want to count for insertions when we try to identify SNPs
-
In this example, the insertion happened at base 93 but it doesn't show on the MD:Z tag. Instead it shows 95 bp mapped to the reference.
Example 1:
## CIGAR -- [(11, 'M'), (1, 'D'), (92, 'M'), (1, 'I'), (3, 'M'), (1, 'D'), (42, 'M')]
## MDZ -- 11^G95^G42
# reference read
CTTGTCAATGTGAAGGGTGAAAACATCACCAATGCCCCTGAACTGCAGCCGAATGCTGTCACTTGGGGCATCTTCCCTGGGCGAGAGATCATCCAGCCCACCGTAGTGGATCCCGTCAGCTTCATGTTCTGGAAGGACGAGGCCTTTGCC
# sam read: pos 1799
CTTGTCAATGTAAGGGTGAAAACATCACCAATGCCCCTGAACTGCAGCCGAATGCTGTCACTTGGGGCATCTTCCCTGGGCGAGAGATCATCCAGCCCACCGTAAGTGATCCCGTCAGCTTCATGTTCTGGAAGGACGAGGCCTTTGCC
My approach is to get the insertion position and length first and store them into a list. We can obtain this information by looking at the CIGAR string. In the example above, the insertion happened after the reference base 103(11+92)
. (See the full function in locate_mut.py
)
# check CIGAR string and get a list contains the insertion position and inserted length
mut_list = []
# check insertion position in cigar string
# This position corresponds to the reference sequence
# i.e if pos = 110 means the base 110 on the read was inserted (all 1-based indices)
ins_pos_ref = 0 # on ref sequence where was the insertion
ins_pos = [] # store [ins_pos, ins_len]
mapped = 0
deleted = 0
for i in cigar:
if i[1] != "I": # not insertion
ins_pos_ref += int(i[0])
if i[1] == "M": # track number of bases mapped
mapped += int(i[0])
if i[1] == "D": # track number of bases deleted
deleted += int(i[0])
else:
# based on number of bases mapped, get the inserted base from the read
ins_base = read[mapped:mapped+int(i[0])]
# keep the insertion position and inserted lenth in a list
ins_pos.append([ins_pos_ref, int(i[0])])
# add the insertion to mut_list
mut_list.append(str(pos+mapped+deleted)+"|"+ins_base+"|ins")
Then we can merge this with the MDZ string. When parsing the MDZ string, we need to keep tracking if insertions occurred when it shows 'mapped' on MDZ This becomes a simpler problem if we think about that we have two input lists:
ins_pos = [[10, 1], [103, 1]] # list sorted base on the first element of each pair
pos = [14, 30, 70, 5]
Now we are trying to see at which point the sum of items in list_b is greater than items in list_a[i][0]
, if so, add list_a[i][1]
to the sum. This means that at which point we need to count the insertions as we are moving the cursor to the right.
For example, in this case the output would look like:
# [14+1, 14+1+30, 14+1+30+70+1, 14+1+30+70+1+5]
# 14+1 because 14 is greater than 10
# 14+1+30+70+1 because 14+1+30+70 is greater than 103
[15, 45, 116, 121]
Here is the code to locate mutations given insertion positions and insertion length. It is similar to the first function (they can be merged together into one function)
# parse snp and deletion from mdz string
# given the inserted position on reference sequence
r = re.compile("([0-9]+)([a-zA-Z\^]+)")
read_pos = 0 + clip # on the sam read
deleted_len = 0 # how many bp was deleted
map_pos = 0 # how many bp was mapped (non insertion track)
iter_ins = iter(ins_pos) # create iterator to go through ins_pos
ins = next(iter_ins, None) # if we reach the end, return None
inserted_pos = 0
for i in mdz:
# for each item in the MD:Z string
# split the item into number and letter
m = r.match(i)
match_len = int(m.group(1))
base = m.group(2)
map_pos += match_len # update how many bp are mapped
if ins and map_pos >= ins[0]:
inserted_pos += ins[1]
ins = next(iter_ins, None)
if "^" not in base:
# this means a single nt change
read_pos += match_len
mut_list.append(str(pos+read_pos-clip)+"|"+base+"|"+read[read_pos+inserted_pos-deleted_len])
map_pos += len(base)
read_pos += 1
else: # deletion
read_pos += match_len
mut_list.append(str(pos+read_pos-clip)+"|"+base[1:]+"|del")
deleted_len += len(base[1:])
read_pos += len(base[1:])
map_pos -= len(base[1:])
return mut_list