Filter FASTQ files against all 1000 Genomes sequencing data using k-mers. Keep only reads with k-mers missing in 1000 Genomes.
rdxon is available as a pre-compiled statically linked binary from rdxon's github release page, as a singularity container SIF file or as a minimal Docker container.
git clone --recursive https://github.com/tobiasrausch/rdxon.git
cd rdxon/
make all
Download the 1000 Genomes k-mer maps here: http://gear-genomics.embl.de/data/rdxon/
To filter an input FASTQ file against the 1000 Genomes sequencing data simply run
rdxon filter -x kmer.x.map -y kmer.y.map -o <output.fq.gz> <input.fq.gz>
You can also dump all rare k-mers which are absent in 1000 Genomes to a file
rdxon filter -x kmer.x.map -y kmer.y.map -u <kmer.gz> -o <output.fq.gz> <input.fq.gz>
For paired-end data you can run Read1 and Read2 in parallel and then concatenate the output FASTQ files.
For certain downstream applications you may want to retain proper paired-ends. The paired-end mode of the filter subcommand is:
rdxon filter -x kmer.x.map -y kmer.y.map -o <outprefix> <read1.fq.gz> <read2.fq.gz>
For tumor-normal sequencing in cancer genomics, you can also filter for reads that contain rare and somatic k-mers.
rdxon somatic -x kmer.x.map -y kmer.y.map -o <output.fq.gz> <tumor.fq.gz> <control.fq.gz>
The somatic subcommand is also available in paired-end mode.
rdxon somatic -x kmer.x.map -y kmer.y.map -o <outprefix> <tumor.1.fq.gz> <tumor.2.fq.gz> <control.1.fq.gz> <control.2.fq.gz>
Whole-exome sequencing: ~1 hour and ~4G RAM (single CPU, one job for Read1 and Read2)
Whole-genome sequencing: ~6 hours and ~4G RAM (single CPU, one job for Read1 and Read2)
The 1000 Genomes high-coverage data were generated at the New York Genome Center with funds provided by NHGRI Grant 3UM1HG008901-03S1. All cell lines were obtained from the Coriell Institute for Medical Research and from the NIGMS Human Genetic Cell Repository at the Coriell Institute for Medical Research. More information regarding the 1000 Genomes high-coverage data and data reuse is available here: http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000G_2504_high_coverage/.