A simple tool for determining whether two BAM files contain reads sequenced from the same sample or patient by counting genotype matches at common SNPs.
BAM-matcher is most useful at comparing whole-genome-sequencing (WGS), whole-exome-sequencing (WES) and RNA-sequencing (RNA-seq) human data, but can also be customised to compare panel data or non-human data.
Once configuration file is setup, to compare two bam files (sample1.bam and sample2.bam) just run:
bam-matcher.py -B1 sample1.bam -B2 sample2.bam
which will give an output like:
BAM1: sample1.bam
BAM2: sample2.bam
depth threshold: 15
____________________________________
Positions with same genotype: 243
breakdown: hom: 51
het: 192
____________________________________
Positions with diff genotype: 158
breakdown:
BAM 1
| het | hom | subset
-------+------+------+-------
het | 0 | 0 | 76
-------+------+------+-------
BAM 2 hom | 0 | 0 | -
-------+------+------+-------
subset| 82 | - | -
____________________________________
Total sites compared: 401
Fraction of common: 0.605985 (243/401)
CONCLUSION: DIFFERENT SOURCES
See the wiki page for detailed installation guide.
Python
(version 2.7)
Python libraries
- PyVCF
- ConfigParser
- Cheetah
- pysam (requires python-dev and zlib1g-dev libraries)
- fisher (requires numpy and python-dev libraries).
(Require at least one)
- GATK (requires Java)
- VarScan2 (requires Java and Samtools)
- Freebayes
cd /directory/path/where/bam-matcher/is/to/be/installed/
git clone https://bitbucket.org/sacgf/bam-matcher.git
This provides:
# Python scripts and libraries
bam-matcher.py
bammatcher_methods.py
bammatcher_exp.py
generate_example_data.py
# template files for configuration and HTML output
bam-matcher.conf.template
bam_matcher_html_template
# VCF files and example chromosome map file
1kg.exome.highAF.1511.vcf
1kg.exome.highAF.3680.vcf
1kg.exome.highAF.7550.vcf
hg19.chromosome_map
# directory containing example BAM files
test_data/
# miscellaneous
contributors.txt
LICENSE
requirements
README.md
To make bam-matcher.py executable from anywhere in the system, add the directory containing bam-matcher.py to your PATH variable. e.g. add this line to your ~/.bashrc:
export PATH=$PATH:/path/to/bam-matcher/
The repository includes 3 VCF files which can be used for comparing human data (hg19/GRCh37).
These VCF files also contain variants extracted from 1000 Genomes project which are all exonic and have high likelihood of switching between REF and ALT alleles (global allele frequency between 0.45 and 0.55). The only difference between them is the number of variants contained within.
The repository also includes several BAM files which can be used for testing (under test_data directory), as well as the expected results for various settings.
BAM-matcher requires a configuration file. The default configuration path recognised by BAM-matcher is "bam-matcher.conf" in the same directory as bam-matcher.py.
cd /path/to/bam-matcher/
cp bam-matcher.conf.template bam-matcher.conf
Then edit the file bam-matcher.conf appropriately.
If the template configuration file is missing, it can be generated by the --generate-config (-G)
function.
BAM-matcher.py --generate-config path_to_file_to_be_generated
At the very minimum, you will need to specify in the configuration file:
-
caller:
this is the default variant/genotype caller to use (gatk, varscan, or freebayes) -
settings for whichever caller you have chosen. For GATK, you will need to provide the path to the GATK jar file (
GATK:
); for VarScan, you will need to provide both the path to the VarScan jar file and the command to call SAMtools; for Freebayes, you will need to provide the command/path to call freebayes. GATK and VarScan will also require Java. -
VCF_file:
Specify the FULL PATH to the VCF file containing the variant loci to compare. Three VCF files are provided with BAM-matcher for human hg19 data. -
REFERENCE:
The reference file used for mapping the reads in the input BAM files. This should also be the same version of genome reference for the VCF file. -
CACHE_DIR:
You must supply the path to a directory with read and write permission for all users. This is used to store cached genotype data.
Most configuration settings can also be overridden at run time.
For detailed instruction on how to set up a configuration file, see the wiki page.
If configured correctly, to compare two bam files, you just need to run:
bam-matcher.py --bam1 sample1.bam --bam2 sample2.bam -o output_report.txt
For detailed information on runtime arguments and parameters, see the wiki page.
See the tutorial on how to test BAM-matcher using the example data.
The code is released under the Creative Commons by Attribution licence (http://creativecommons.org/licenses/by/4.0/). You are free to use and modify it for any purpose (including commercial), so long as you include appropriate attribution.
BAM-matcher: a tool for rapid NGS sample matching
Paul P.S. Wang; Wendy T. Parker; Susan Branford; Andreas W. Schreiber Bioinformatics 2016
doi: 10.1093/bioinformatics/btw239
Paul (paul.wang @ sa.gov.au)