Skip to content
pd3 edited this page May 8, 2014 · 20 revisions

Installing samtools and bcftools

git clone --branch=bcftools+calling git://github.com/samtools/htslib.git
git clone git://github.com/samtools/bcftools.git
git clone git://github.com/samtools/samtools.git
cd bcftools; make; cd ..
cd samtools; make;

Variant calling via mpileup
samtools mpileup -uf ref.fa aln.bam | bcftools call -mv -Oz > calls.vcf.gz
Note that the calling can be made more sensitive/restrictive by using different prior,
the default is bcftools call -P1e-3.

Variant filtering
Variant filtering is an area of active development and is not easy. The annotations produced by variant callers provide only indirect hints about quality of the calls, an approach which worked for one dataset may not work for another. Nowadays most powerful seem machine learning approaches such as SVM (not implemented in bcftools), see an example of SVM filtering pipeline here.

The least one can do is to apply fixed-threshold filtering. One can get a feel for typical distributions of annotations from a set of known good calls vs all calls. If you don’t already have a pipeline for this, use bcftools query to extract the annotations and plot manually.

A good measure of the callset quality is often ts/tv, the ratio of transitions and transversions. Try the following command with different QUAL thresholds
bcftools filter -i'%QUAL>20' calls.vcf.gz | bcftools stats | grep TSTV

Another useful metrics is sequencing depth (DP bigger than twice the average depth indicates problematic regions and is often enriched for artefacts); the minimum number of high-quality non-reference reads (DV); proximity to indels (-g), etc. To give a concrete example, the following filter seemed to work quite well for one particular dataset (human data, exomes):

bcftools filter -sLowQual -g3 -G10 \
-e'%QUAL<10 || (RPB<0.1 && %QUAL<15) || (AC<2 && %QUAL<15) || %MAX(DV)<=3 || %MAX(DV)/%MAX(DP)<=0.3' \
calls.vcf.gz

Consensus calling via vcf2fq
samtools mpileup -uf ref.fa aln.bam | bcftools call -c | vcfutils.pl vcf2fq > cns.fq

Obtaining the list (or number) of samples from a VCF
To get the list of samples:
bcftools query -l file.vcf.gz
To get the number of samples:
bcftools query -l file.vcf.gz | wc -l

Querying a VCF
Retrieve the CHROM, POS and FORMAT/GT field from a VCF, include only lines where ID column is annotated
bcftools query -i'%ID!="."' -f'%CHROM %POS[ %GT]\n' file.vcf.gz

Clone this wiki locally