Skip to content

Pipeline for gene candidate discovery based on mutagenesis. Featured in gene cloning research published in Nature (DOIs: 10.1038/s41467-020-14937-2, 10.1038/s41467-021-23738-0, 10.1038/s41477-021-00971-5), Molecular Plant (DOI: 10.1016/j.molp.2021.05.010) and New Phytologist (DOI: 10.1111/nph.17075).

Notifications You must be signed in to change notification settings

TC-Hewitt/MuTrigo

Repository files navigation

MuTrigo

This pipeline was optimised for mutagen induced SNP/indel discovery in short-read sequence data from exome capture experiments. However, flexible options allow wider application in most sequencing experiments involving alignment and variant calling against a reference. It provides a streamlined method to quickly identify contigs, transcripts or genomic regions on which multiple independent mutations can be detected across multiple individuals in whole exome or genome read mapping data.

Tool descriptions

Noisefinder.py

Regions rich in mismatches/poor coverage after read alignment can often signify misalignment or mixed alignment due to allelism, polyploidy, or presence/absence variation. Given a pileup file, noisefinder reports regions containing a density of SNVs above a user defined threshold over a given min length and min read depth (prints to STDOUT).

SNPlogger.py

Will parse an mpileup file and log all SNPs and indels that satisfy parameters. Final tally printed to STDOUT. Outfile is formatted as tab sep fields: seqid, position(1based), polymorphic-type, frequency(float). Compatible with STDIN.

SNPtracker.py

Finds sequence IDs/regions with coinciding polymorphic features across multiple SNPlogger generated files

NoiseOutStats.py

get statistics from output files of Noisefinder.py

Example Workflow

This specific workflow is designed to discover sequences/contigs that contain mutagen induced variation occuring independently across a number of mutants. In mutagenesis experiments for which single gene knockouts can be selected for phenotypically, such a finding is strongly indicative that the target gene has been isolated given a sufficient number of mutants. It is based on generating a de novo assembly from wild-type NGS reads followed by aligning mutant NGS reads independently against the wild-type assembly and recording any mismatches between each mutant and the wild-type. Ideally, the wild-type should be parental to the mutants and all be near-isogenic lines in order to minimise noise due to normal genetic variation. This pipeline is inspired by similar pipelines such as MutantHunter (https://github.com/steuernb/MutantHunter), but takes an alternate approach with added flexibility.

Prerequisites

Python 3.x.x

see https://www.python.org/downloads/

BWA or other suitable aligner

see http://bio-bwa.sourceforge.net/

Samtools 1.5 or later

see http://samtools.sourceforge.net/

De novo assembly software

can be of your choosing but should be relatively stringent to avoid collapsing highly homologous yet distinct sequences as a single consensus. The CLC assembly cell (https://www.qiagenbioinformatics.com/products/clc-assembly-cell/) or MaSuRCA assembler (http://www.genome.umd.edu/masurca.html) generate suitable assemblies.

Preprocessing

clean raw data

if data is not already cleaned/trimmed, software such as Trimmomatic can be used (http://www.usadellab.org/cms/?page=trimmomatic) for example if your data is paired-end then for each fastq file you would do something like:

trimmomatic PE -threads 8 -phred33 readsIn_1.fq.gz readsIn_2.fq.gz readsOut_1.clean.fq.gz readsOut_1.unpaired.fq.gz readsOut_2.clean.fq.gz readsOut_2.unpaired.fq.gz ILLUMINACLIP:adapter_seqs.fasta:2:30:10:8:TRUE LEADING:28 TRAILING:28 MINLEN:20

de novo assembly of wild-type

as stated earlier, use tool of your choice, but ensure a decent N50 as a quality assembly is the crux of the whole procedure. Note that some assemblers prefer the input to be raw, uncleaned data (MaSuRCA).

mapping to wild-type

as well as the mutants, the wild-type reads should also be mapped to their assembly to ensure greater accuracy in later steps. Again, choice of alignment software is at your discretion but BWA and samtools offer a straightforward process. Initially index the assembly:

bwa index WT_assembly.fasta
samtools faidx WT_assembly.fasta

then run the following steps for each mutant and wild-type:

bwa aln assembly.fasta read1.fastq > read1.aln
bwa aln assembly.fasta read2.fastq > read2.aln
bwa sampe assembly.fasta read1.aln read2.aln read1.fastq read2.fastq > raw.sam
samtools view -hub -o raw.bam raw.sam (or replace raw.sam with '-' if piping from STDIN)
samtools sort -o sorted.bam -O BAM raw.bam
samtools rmdup sorted.bam rmdup.bam
samtools index rmdup.bam

Mutant discovery steps

Using example files for wildtype "WT.rmdup.bam" plus mutants "mut1.rmdup.bam", "mut2.rmdup.bam", "mut3.rmdup.bam"...

1) create pileup of WT bam only.

samtools mpileup -BQ0 -a -f WT_assembly.fasta WT.rmdup.bam > WT.pileup

2) run Noisefinder on WT pileup.

python Noisefinder.py -i WT.pileup > WT.noise.log

3) run SNPlogger on WT and mutants using WT.noise.log to mask rubbish regions.

Run "python SNPlogger.py -h" to see additional options as noise.log files generated from mutants can be used as features themselves in later steps.

for WT, SNPlogger can be run on already created pileup:

python SNPlogger.py -i WT.pileup -b WT.noise.log -o WT.snp.log

for mutants, pileups can be created on the fly and piped directly to SNPlogger.

in a loop:

for i in mut{1..3}; do samtools mpileup -a -BQ0 -f WT_assembly.fasta ${i}.rmdup.bam | python SNPlogger.py -b WT.noise.log -o ${i}.snp.log; done

or in a parallel job

note that SNPlogger will print a summary of SNP statistics to the screen upon completion of each file. To save these stats, use ">" to redirect standard output to a file:

for i in m{1..3}; do samtools mpileup -a -BQ0 -f WT_assembly.fasta ${i}.bam | python SNPlogger.py -b WT.noise.log -o ${i}.snp.log > ${i}.stats.txt; done

4) run SNPtracker on snp.log files.

use "-w" for WT file(s), "-m" for mutant files. SNPtracker can still work without a WT or with >1 WT. This step is relatively fast and can complete in seconds:

python SNPtracker.py -w WT.snp.log -m mut1.snp.log mut2.snp.log mut3.snp.log

run "python SNPtracker.py -h" to see options. E.g. to only report polymorphisms that are C>T, G>A or indels, use "-s" option. To filter polymorphisms based on frequency (default=0.8), use "-f" option. To create detailed reports in addition to the summary report, use "-v T" option. To tolerate N mutants with identical SNPs (e.g. siblings), use "-t N" option:

python SNPtracker.py -w WT.snp.log -m mut1.snp.log mut2.snp.log mut3.snp.log -s C\>T G\>A indel -f 0.9 -v T -t 2

This will create a "SNPtracker_summary.html" report whose output might look like the following:

<summary>

<parameters>
wildtypes: WT.snp.log
mutants: mut1.snp.log, mut2.snp.log, mut3.snp.log
selected: C>T, G>A, indel
filtered: 0.9
proximal: OFF
tolerate: 2

<polymorphic in 3 mutants>
contig_5565     (mut1.snp.log, mut2.snp.log, mut3.snp.log)

<polymorphic in 2 mutants>
contig_8725     (mut1.snp.log, mut3.snp.log)
contig_1252     (mut2.snp.log, mut3.snp.log)

If the "-v" option is set to true, the detailed reports generated display the coordinate and type of polymorphisms found in each of the mutants for a given candidate contig/sequence. A report is generated for each N>1. i.e. if there are 5 mutants, SNPtracker will create an N5.report, N4.report, N3.report and N2.report. That is unless a lower limit is specified with the "-n" option. An example of an N2.report may look like the following:

<polymorphic in 2 mutants>

<contig_8725>
mut1.snp.log [(1200, G>A, 0.98)]
mut3.snp.log [(3210, C>T, 0.91), (4128, indel+5, 1.0)]

<contig_1252>
mut2.snp.log [(852, C>T, 0.95), (3069, G>A. 0.90)]
mut3.snp.log [(2567, indel-8, 1.0)]

Alignments for these candidate contigs can also be inspected visually upon loading the bam files into a genome browser such as IGV (http://software.broadinstitute.org/software/igv/).

SNPtracker also provides a "-p/--proximal" option that tells SNPtracker to find features that reside close to each other within a user defined window (min=1000 bases) rather than only coinciding on a particular contig. This is suitable if working with large scaffolds or pseudomolecules that may contain multiple genes. For example, to report features only within max 10kb of each other:

python SNPtracker.py -w WT.snp.log -m mut1.snp.log mut2.snp.log mut3.snp.log -p 10000

About

Pipeline for gene candidate discovery based on mutagenesis. Featured in gene cloning research published in Nature (DOIs: 10.1038/s41467-020-14937-2, 10.1038/s41467-021-23738-0, 10.1038/s41477-021-00971-5), Molecular Plant (DOI: 10.1016/j.molp.2021.05.010) and New Phytologist (DOI: 10.1111/nph.17075).

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages