Skip to content

Variant Normalization

Jacobo Coll Moragón edited this page Nov 9, 2015 · 8 revisions

Overview

A genomic variant is represented by a locus (chromosome + position), reference sequence and list of alternates.

Is common, because of the VCF specification, that the reference and alternate fields contain extra bases not needed for the Variant representation. It is completely valid to specify a variation like chr1:100:AC:AT, which is absolutely the same variant that chr1:101:C:T.

The number of possible combinations to represent the same genomic variant is potentially infinite, so it is mandatory to normalize the representation of the variant in order to determine when two representations are the same or different variants.

Steps

The variant normalization perform different steps over each variant to make a full normalization.

Chromosome naming:

Due to there is not any standard for the chromosome naming, is common to see different names for the same chromosome, depending on the used tools, by adding a prefix to the name. For example, we can see chr[1-22,X,Y] for the One Thousand Genomes Project. It is known that this is a chromosome, tt is no needed to add any prefix for each variant. The list of known chromosome prefixes are: chrom, chrm, chr and ch.

Reference/Alternate Trimming:

Simple trimming

chr1   .   100     CTC     CCC
chr1   .   101     T       C

Deletions

chr1   .   100     AT      A
chr1   .   101     T       -

Insertions

chr1   .   100     A       AT
chr1   .   101     -       T

Ambiguous trimming

chr1   .   100     AAA     A
chr1   .   100     AA      -

Complex trimming

chr1   .   100     ATC     ACCC
chr1   .   102     T       CC
Multi-allelic split:
Clone this wiki locally