-
Notifications
You must be signed in to change notification settings - Fork 97
Variant Normalization
A genomic variant is represented by a locus (chromosome + position), reference sequence and list of alternates.
Is common, because of the VCF specification, that the reference and alternate fields contain extra bases not needed for the Variant representation. It is completely valid to specify a variation like chr1:100:AC:AT
, which is absolutely the same variant that chr1:101:C:T
.
The number of possible combinations to represent the same genomic variant is non-unique, so it is mandatory to normalize the representation of the variant in order to determine when two representations are the same or different variants. A failure to recognize this will frequently result in inaccurate analyses.
The variant normalization perform different steps over each variant to make a full normalization.
Due to there is not any standard for the chromosome naming, is common to see different names for the same chromosome, depending on the used tools, by adding a prefix to the name. For example, we can see chr[1-22,X,Y]
for the One Thousand Genomes Project. It is known that this is a chromosome, it is no needed to add any prefix for each variant. The list of known chromosome prefixes are: chrom, chrm, chr and ch.
Reference and alternate trimming consists on removing the trailing (right trimming) and leading (left trimming) bases that are identical in both alleles.
Left aligning a variant means shifting the start position of that variant to the left while keeping the same alleles till it is no longer possible to do so.
- Right and Left trimming
chr1 . 100 CTC CCC
chr1 . 101 T C
- Indels
Insertions and deletions are represented with empty alleles when are not mix with SNVs
Deletion of one base T
at position 101
chr1 . 100 AT A
chr1 . 101 TC C
chr1 . 101 T -
Insertion of one C
at position 201 (between 200 and 201)
chr1 . 200 G GC
chr1 . 201 A CA
chr1 . 201 - C
- Ambiguous trimming and left alignment
It may happen, in case of deletion or insertion in sequences of repeated nucleotides, that determine the position of the variant is ambiguous. In this example we can find that there are three possible ways of normalize the variant:
chr1 . 100 CTCTCA CTCA
chr1 . 100 CT -
chr1 . 101 TC -
chr1 . 102 CT -
chr1 . 103 TC -
We guarantee the left alignment by performing first the right trimming. This variant will be normalized as:
chr1 . 100 CT -
OpenCGA is an open source project and it is freely available.
General
- Home
- Architecture
- Data Models
- RESTful Web Services
- Configuration
- Download and Installation
- Tutorials
OpenCGA Catalog
OpenCGA Storage
About