Skip to content

Parse CIGAR and MD:Z string to get mutations (in Python3, with examples)

Ryoga Li edited this page Jan 22, 2020 · 10 revisions

In this post, I would like to summary about how to get mutations from the SAM files using CIGAR string and MD:Z string. All the functions here can be found in the python script locate_mut.py

CIGAR and MD:Z string in SAM file

  1. How to read CIGAR string Link
  2. MD:Z tag:

MD:Z:[0-9]+(([A-Z]|^[A-Z]+)[0-9]+)* String for mismatching positions. The MD field aims to achieve SNP/indel calling without looking at the reference. For example, a string ‘10A5^AC6’ means from the leftmost reference base in the alignment, there are 10 matches followed by an A on the reference which is different from the aligned read base; the next 5 reference bases are matches followed by a 2bp deletion from the reference; the deleted sequence is AC; the last 6 bases are matches. The MD field ought to match the CIGAR string.

  1. For MD:Z tag, it ignores insertions. We can combine these two to call mutations in a given read

The output contains 3 types of mutations

  1. Insertion: [ref_pos|inserted_bases|ins]
  2. Deletion: [ref_pos|deleted_bases|del]
  3. SNP: [ref_pos|ref_base|alt_base]

Parsing the MD:Z tag (without insertions)

  • We can see if there is an insertion by checking the CIGAR tag, if there's no insertion in a read, we can get all the mutations solely from the MD:Z tag.
  • Note here all the coordinates in the output list are 1-based.
  1. In this example, base 141 of this read was deleted. that would convert to the reference base 1939
Example 1:
## cigar -- [(140, 'M'), (1, 'D'), (9, 'M')]
## MD:Z  -- 140^G9
# reference read
CTTGTCAATGTGAAGGGTGAAAACATCACCAATGCCCCTGAACTGCAGCCGAATGCTGTCACTTGGGGCATCTTCCCTGGGCGAGAGATCATCCAGCCCACCGTAGTGGATCCCGTCAGCTTCATGTTCTGGAAGGACGAGGCCTTTGC
# sam read pos: 1799
CTTGTCAATGTGAAGGGTGAAAACATCACCAATGCCCCTGAACTGCAGCCGAATGCTGTCACTTGGGGCATCTTCCCTGGGCGAGAGATCATCCAGCCCACCGTAGTGGATCCCGTCAGCTTCATGTTCTGGAAGGACGAGCCTTTGCC
## mutations: ['1939|G|del']
  1. In this example, base 124 was deleted, 5bp after that, 3 bases were changed (ref:TGG)
Example 2: 
## cigar -- [(123, 'M'), (1, 'D'), (27, 'M')]
## MD:Z  -- 123^A5T0G0G19
# reference read
CTTGTCAATGTGAAGGGTGAAAACATCACCAATGCCCCTGAACTGCAGCCGAATGCTGTCACTTGGGGCATCTTCCCTGGGCGAGAGATCATCCAGCCCACCGTAGTGGATCCCGTCAGCTTCATGTTCTGGAAGGACGAGGCCTTTGCC
# sam read - pos: 1799
CTTGTCAATGTGAAGGGTGAAAACATCACCAATGCCCCTGAACTGCAGCCGAATGCTGTCACTTGGGGCATCTTCCCTGGGCGAGAGATCATCCAGCCCACCGTAGTGGATCCCGTCAGCTTCTGTTCGTTAAGGACGAGGCCTTTGCCC
# mutations: ['1922|A|del', '1928|T|G', '1929|G|T', '1930|G|T']
  1. In this example, base 69 was changed from C to A, followed by 73 bases match and 1bp deletion
Example 3: 
## CIGAR -- [(142, 'M'), (1, 'D'), (7, 'M')]
## MDZ   -- 68C73^C7
# reference read
CTTGTCAATGTGAAGGGTGAAAACATCACCAATGCCCCTGAACTGCAGCCGAATGCTGTCACTTGGGGCATCTTCCCTGGGCGAGAGATCATCCAGCCCACCGTAGTGGATCCCGTCAGCTTCATGTTCTGGAAGGACGAGGCCTTTGCC
# sam read - pos: 1799
CTTGTCAATGTGAAGGGTGAAAACATCACCAATGCCCCTGAACTGCAGCCGAATGCTGTCACTTGGGGAATCTTCCCTGGGCGAGAGATCATCCAGCCCACCGTAGTGGATCCCGTCAGCTTCATGTTCTGGAAGGACGAGGCTTTGCC
# mutations: ['1867|C|A', '1941|C|del']
  • how to divide CIGAR into a list of tuples
# can also do it with re
import cigar
CIGAR = "142M1D7M"
cigar_list = list(cigar.Cigar(CIGAR).items())
print(cigar_list)
# [(142, 'M'), (1, 'D'), (7, 'M')]
  • Function to locate mutations in a given Sam file entry
def parse_mdz(cigar, mdz_raw, ref, read, pos, qual):
  """
  cigar: CIGAR string from SAM file [(142, 'M'), (1, 'D'), (7, 'M')]
  mdz: MD:Z tag from SAM file '68C73^C7'
  ref: Reference sequence (template sequence) 
  read: Read from SAM file
  read_pos: Mapping position of this read. Where the first base maps to
  qual: Quality score sequence of this read
        
  Get mutation from MD:Z string
  Insertions are not represented by MD:Z tag
  Insertions are processed from CIGAR string
  """
  # check soft clipped bases
  # soft clipped means that the bases on the read were clipped when mapping
  clip = 0
  # We can use this to adjust the starting point of the sequence 
  # remove soft clipped bases and adjust for location 
  if cigar[0][1] =="S": # it always occurs in the beginning or at the end
     # adjust read position
     clip += cigar[0][0]

  # user regular expression to split the MD:Z string into chunks with mutations
  # i.e `13C14^T4T0G46` -- ['13C', '14^T', '4T', '0G']
  mdz = re.findall('.*?[.ATCG]+', mdz_raw)
  r = re.compile("([0-9]+)([a-zA-Z\^]+)")

  read_pos = 0 + clip # track positions on the read, it starts after the clipped region
  mut_list = [] # store mutations 
  deleted_len = 0 # track number of bases that were deleted
  for i in mdz: # parsing the read from left to right
      # for each item in the MD:Z string 
      # split the item into number and letter
      m = r.match(i)
      match_len = int(m.group(1))
      base = m.group(2)
      if "^" not in base:
         # this means a single nt change
         # base = reference base
         # pos+read_pos-clip = read starting point + # of bp mapped since that point - clipped base (because they are not on reference sequence)
         # read_pos-deleted_len = # of bases mapped since the beginnig of this read - number of bases deleted
         read_pos += match_len # update read position as we are moving to the right
         mut_list.append(str(pos+read_pos-clip)+"|"+base+"|"+read[read_pos-deleted_len])
         read_pos += 1 # add 1 for the current base

      else: # deletion
         read_pos += match_len # update read position as we are moving to the right
         # pos+read_pos-clip = read starting point + # of bp mapped since that point - clipped base (because they are not on reference sequence)
         # base[1:] = bases that were deleted
         mut_list.append(str(pos+read_pos-clip)+"|"+base[1:]+"|del")
         # update number of deleted bases
         deleted_len += len(base[1:])
         read_pos += len(base[1:])

   return mut_list

Parsing the MD:Z tag (with insertions)

  • In this case, we need to parse CIGAR string with MD:Z tag to find insertions

  • The MDZ part is slightly different because we also want to count for insertions when we try to identify SNPs

  • In this example, the insertion happened at base 93 but it doesn't show on the MD:Z tag. Instead it shows 95 bp mapped to the reference.

Example 1:
## CIGAR -- [(11, 'M'), (1, 'D'), (92, 'M'), (1, 'I'), (3, 'M'), (1, 'D'), (42, 'M')]
## MDZ   -- 11^G95^G42
# reference read
CTTGTCAATGTGAAGGGTGAAAACATCACCAATGCCCCTGAACTGCAGCCGAATGCTGTCACTTGGGGCATCTTCCCTGGGCGAGAGATCATCCAGCCCACCGTAGTGGATCCCGTCAGCTTCATGTTCTGGAAGGACGAGGCCTTTGCC
# sam read: pos 1799
CTTGTCAATGTAAGGGTGAAAACATCACCAATGCCCCTGAACTGCAGCCGAATGCTGTCACTTGGGGCATCTTCCCTGGGCGAGAGATCATCCAGCCCACCGTAAGTGATCCCGTCAGCTTCATGTTCTGGAAGGACGAGGCCTTTGCC

My approach is to get the insertion position and length first and store them into a list. We can obtain this information by looking at the CIGAR string. In the example above, the insertion happened after the reference base 103(11+92). (See the full function in locate_mut.py)

# check CIGAR string and get a list contains the insertion position and inserted length

mut_list = []

# check insertion position in cigar string
# This position corresponds to the reference sequence
# i.e if pos = 110 means the base 110 on the read was inserted (all 1-based indices)

ins_pos_ref = 0 # on ref sequence where was the insertion
ins_pos = [] # store [ins_pos, ins_len]

mapped = 0
deleted = 0
for i in cigar:
  if i[1] != "I": # not insertion
    ins_pos_ref += int(i[0])
    if i[1] == "M": # track number of bases mapped
      mapped += int(i[0])
    if i[1] == "D": # track number of bases deleted
      deleted += int(i[0])
  else:
    # based on number of bases mapped, get the inserted base from the read
    ins_base = read[mapped:mapped+int(i[0])]
    # keep the insertion position and inserted lenth in a list
    ins_pos.append([ins_pos_ref, int(i[0])]) 
    # add the insertion to mut_list
    mut_list.append(str(pos+mapped+deleted)+"|"+ins_base+"|ins")

Then we can merge this with the MDZ string. When parsing the MDZ string, we need to keep tracking if insertions occurred when it shows 'mapped' on MDZ This becomes a simpler problem if we think about that we have two input lists:

ins_pos = [[10, 1], [103, 1]] # list sorted base on the first element of each pair
pos = [14, 30, 70, 5]

Now we are trying to see at which point the sum of items in list_b is greater than items in list_a[i][0], if so, add list_a[i][1] to the sum. This means that at which point we need to count the insertions as we are moving the cursor to the right.

For example, in this case the output would look like:

# [14+1, 14+1+30, 14+1+30+70+1, 14+1+30+70+1+5]
# 14+1 because 14 is greater than 10
# 14+1+30+70+1 because 14+1+30+70 is greater than 103
[15, 45, 116, 121]

Here is the code to locate mutations given insertion positions and insertion length. It is similar to the first function (they can be merged together into one function)

# parse snp and deletion from mdz string
# given the inserted position on reference sequence
r = re.compile("([0-9]+)([a-zA-Z\^]+)")
        
read_pos = 0 + clip # on the sam read
deleted_len = 0 # how many bp was deleted
map_pos = 0 # how many bp was mapped (non insertion track)
        
iter_ins = iter(ins_pos) # create iterator to go through ins_pos
ins = next(iter_ins, None) # if we reach the end, return None
        
inserted_pos = 0
for i in mdz:
  # for each item in the MD:Z string 
  # split the item into number and letter
  m = r.match(i)
  match_len = int(m.group(1))
  base = m.group(2)

  map_pos += match_len # update how many bp are mapped 
  if ins and map_pos >= ins[0]:
    inserted_pos += ins[1]
    ins = next(iter_ins, None)
            
  if "^" not in base:
    # this means a single nt change
    read_pos += match_len 
    mut_list.append(str(pos+read_pos-clip)+"|"+base+"|"+read[read_pos+inserted_pos-deleted_len])
    map_pos += len(base)
    read_pos += 1
  else: # deletion
    read_pos += match_len
    mut_list.append(str(pos+read_pos-clip)+"|"+base[1:]+"|del")
    deleted_len += len(base[1:])
    read_pos += len(base[1:])
    map_pos -= len(base[1:])
        
return mut_list