For this pipeline, you do not need to have MySQL installed (Stacks is designed to run with it, but does not need it).
All of the perl
scripts written for this pipeline are coded so that if the name of the script is entered in the command line with no other arguments, then the script will print the USAGE
with an explanation of the required input options. For example:
$ perl 1_trimFastq.pl
1_trimFastq.pl <infile.fastq> <outfile.fastq> <length>
infile.fastq = The input fastq file
outfile.fastq = The name of a fastq file for the trimmed output
length = The length to trim all reads to
There are a couple of exceptions, where a script is designed to automatically run on all files in the current directory when it is called. These scripts are pointed out whenever they appear in these instructions.
- De-multiplex and clean up reads to get individual sequences
- Collapse sequences from the same dRAD site to get individual tag sequences
- Create a “Consensus Reference” sequences based on the tags from all individuals
- Align the individual sequence data back to the “Reference” genome
- Call the SNPs for each individual and merge across all individuals
Use process_radtags
from the package to sort the raw data into individual files based on barcode.
- Create a directory for the output called "Pools_Dir" and a file with a list of the barcode sequences with sample names called "5bp.pools". The format for the barcode file should be 1 line per barcode, with the barcode given first, then the corresponding sample name separated by a tab.
- Specify the restriction enzyme used to cut by the barcode site (in this example -e ecoRI), and the type of quality score used in the raw data file (this will depend on the machine used to sequence the data, in this case it is –E phred33).
- Retain reads with a single error in the barcode or cut-site, and correct the error (-r); remove reads where the quality threshold drops below the cut-off (-q 10) in a given window size (-w .11). This will also remove reads with a low quality mate (for paired end data).
Single End Example:
process_radtags -f 'raw.data.txt' -o Pools_Dir/ -b '5bp.pools' -e ecoRI -E phred33 -r -q -w .11
Paired End Example:
process_radtags -1 'raw_data_1.txt' -2 'raw_data_2.txt' -o Pools_Dir/ -b '5bp.pools' -e ecoRI -E phred33 -r -q -w .11
As of the writing of these instructions, process_radtags
required barcodes to be of the same length. This means if you have barcodes of differing lengths, you will need to make a separate file for each of them, and run process_radtags
multiple times. After doing this, you need to merge the sorted files together for each individual, and then trim all reads to be the same length:
Pools_Dir$ perl mergeRename.pl 'barcode.pools' SE/PE
Pools_Dir$ perl trimFastq.pl 'Individual_1.fq' 'Individual_1_t.fq' 94
After trimming ALL of the individual fastq files to be the same length, use ustacks
(part of Stacks) to find all unique and nearly unique (99% identical) reads for EACH individual. Specify the file type (-t fastq
), the individual fastq file to process (-f Indiv.fq
), the directory for the output file (-o Output_Dir
), and a database ID (-i 1
) (Note that you do not actually need this, but the program will not run without an id specified). In this step, require a minimum depth of 1 (a higher coverage cutoff will be used in a later step; for now, the goal is to get all tags)(-m 1
), and allow only 1 mismatch between reads to call a stack (-M 1 –N1
). Finally, specify the number of processors (-p 15
).
ustacks -t fastq -f 'Indiv_1_t.fq' -o Output_Dir/ -i 1 -m 1 -M 1 -N 1 -p 15
Save only the tags.tsv file (you can ignore the snps.tsv and alleles.tsv files since SNPs will be called with a different program later), and process them such that you save the read depth for each consensus tag and convert the tag file into fasta format. Run the perl script that does this IN THE FOLDER WITH THE TSV FILES!
Tags_Dir$ perl get_stacks_depth.pl
Now filter out reads with excessive depth (since these might be PCR duplicates instead of distinct tag sites). Do this for EACH individual file:
Tags_Dir$ perl filter_by_depth.pl 'Indiv_1_stacks.fasta' 1 50 'Indiv_1_filter.fasta'
To determine what counts as "excessive" depth for your data, you can look at a histogram of the output from get_stacks_depth.pl
.
If you have a lot of individual files, here is an example of a bash
loop that will automatically go through all of them:
for file in Tags_Dir/*_stacks.fasta
do
export samplename=`basename $file _stacks.fasta`
export outname="$samplename.filter.fasta"
perl filter_by_depth.pl $file 1 50 $outname
For EACH “uniqued” and depth-filtered fasta file, run CAP3 with 90% identity to identify and remove any possible paralogs within the individual files.
cap3 'Indiv_1_filter.fasta' -p 90 -o 90
Save ONLY the reads without multiple hits; these are in the cap3.singlets
files. After running this script on all individuals, run a script IN THE FOLDER WITH THESE FILES to process the singlets files into correctly formatted fasta files:
Singlets_Dir$ perl fixFasta.pl
Start by combining all of the “fixed” individual fasta files into 1 file using the unix command cat
:
cat *.fixed.fasta >Combined.fasta
Run ustacks
again to get the necessary tsv files for creating the catalog, then run cstacks
on the ustacks
output. For cstacks
, specify a batch number to be incorporated into the names of all cstacks
output files (-b 604
), an output directory for the cstacks files (-o Output_Dir
), a maximum of 10 mismatches between tags (-n 10
), the prefix for all of the ustacks
output files (this can include the path; -s Ustacks_prefix
), the database ID number from the ustacks run (-S 601
) and the number of processors (-p 45
):
ustacks -t fasta -f 'Combined.fasta' -o /Output/ -i 101 -m 1 -M 10 -p 15
This will take about 24 hours-best to run on a cluster
cstacks -b 604 -o /Output_Dir/ -n 10 -s "Ustacks_prefix" -S 101 -p 45
This should run pretty quickly, but still best on cluster
Again, save only the tags file from the cstacks
output (and not the SNPs or alleles files). The final step is to convert the tags file into a fasta file:
perl tags2fasta.pl 'catalog.tags.tsv' 'Consensus.fasta'
This step uses , and is best done on a cluster so that all individuals can be aligned simultaneously. Here is an example of the steps in bwa (note that indexing of the reference only needs to be done ONCE).
bwa index -p ref_prefix -a is Consensus.fasta
bwa aln -n 10 ref_prefix Indiv_1.fq >Indiv_1.sai
bwa samse ref_prefix Indiv_1.sai Indiv_1.fq >Indiv_1.sam
bwa index -p ref_prefix -a is Consensus.fasta
bwa aln -n 10 ref_prefix Indiv_1.fq_1 >Indiv_1.sai_1
bwa aln -n 10 ref_prefix Indiv_1.fq_2 >Indiv_1.sai_2
bwa sampe ref_prefix Indiv_1.sai_1 Indiv_1.sai_2 Indiv_1.fq_1 Indiv_1.fq_2 >Indiv_1.sam
You can also use bwa mem
for this step; see the BWA manual for details.
Samtools will convert the SAM format files into BAM format and sort them. Note that the reference sequence only needs to be indexed once, but each individual BAM file must be indexed separately.
samtools faidx Ref.fasta
samtools view -O BAM Indiv_1.sam -o Indiv_1.bam
samtools sort -o Indiv_1_sorted.bam -O BAM Indiv_1.bam
samtools index Indiv_1_sorted.bam
Steps 7-11 outline the process that was originally used to process the data in the , but if you are running this pipeline now I would strongly recommend skipping steps 7-11 (which relied on a much older version of samtools
) and proceeding with steps 6a-6c.
These commands come from the Samtools version 1.5 "best practices" page.
Start by adding read group information tags to your individual BAM files, to be sure that the individual sample information is retained during the merging steps later. The ID tag should be the BAM file name (minus the .bam extension), the SM tag should be the sample name (this is what will show up as the label for the sample in the VCF file later).
samtools addreplacerg -r “ID:Indiv_1_sorted” -r “SM:Indiv_1” -o Indiv_1_wRg.bam Indiv_1_sorted.bam
Next, run mpileup
on ALL of your individual BAM files to get a BCF file that contains all of the positions in the genome:
samtools mpileup -go All_sites.bcf -f Ref.fasta *_wRg.bam
If you are not familiar with command line scripting, the *
character is a wild card, so the above command will automatically run on all files ending in "_Rg.bam".
Once you have a .bcf file, run bcftools
to find only the variant sites, and produce the raw VCF file:
bcftools call -vmO z -o All.vcf.gz All.bcf
You can download the latest version of Vcftools . Vcftools offers many different filtering options; the ones used for this pipeline are the max. number of alleles (--max-alleles 2
), the quality threshold (--minQ 20
), and the minimum and maximum read depths per individual (--minDP 5 --maxDP 50
).
vcftools --gvcf All.vcf.gz --max-alleles 2 --minQ 20 --minDP 5 --maxDP 50 --recode --out Filtered.vcf
This step is optional, but if you'd like to use the scripts from the analysis pipeline associated with , then you will need to convert to the .snps format, which I created to make differentiating between populations more straightforward.
perl vcf2snps.pl 'Filtered.vcf' 'Filtered.snps' 'populations.txt'
The "populations.txt" file is a tab-delimited file with 1 row per sample and 2 columns: column 1 should be a numeric population code, and column should be the sample name.
This file will contain coverage information for every possible site in the reference “genome” based on the individual pileup files generated by samtools
. (Note that this step can be a bit slow, since files are being merged one at a time).
perl makeRefpileup.pl 'Ref.fasta' 'Ref.pileup'
samtools mpileup -f Ref.fasta Indiv_1_sorted.bam >Indiv_1.pileup
perl run_covMerge.pl 'Ref.pileup' /Population_Pileup_Dir/
Run covMerge
separately for each population to make it easier to keep track of files and sample order. Be sure to check the path to covMerge.pl
in the run_covMerge.pl
script!
The last ‘.cov’ file made should contain the combined info. for all individuals in the population.
paste Pop_1.cov Pop_2.cov >All_Pops.cov
perl cleanup_cov.pl 'All_Pops.cov' 'Final_Cleaned.cov'
Now you will use the vcfutils and bcfutils tools in the SAMtools package to call SNPs for EACH individual. Use a quality threshold of 20 (-Q 20), and depth cut-off of 50 (-D 50), and a minimum of 1 alternate allele (-a 1). It is easiest to use a script that will run on an entire population directory at once:
perl runVCFutils.pl /Population_BAM_Dir/ 'Ref.fasta'
The above script is running the following samtools commands:
samtools mpileup -uD –f Ref.fasta $bamfile | bcftools view -bvcg - >$bcffile
bcftools view $bcffile | vcfutils.pl varFilter -Q 20 -D50 -a 1 >$vcffile
These new files will have homozygous and heterozygous SNP calls (rather than the binary VCF format). These files will be made individually, but you should run the script within the directory with ALL of the VCF files:
Population_VCF$ perl ../Scripts/vcf2genotype.pl
These scripts first determine all possible SNP sites, and then get a collective genotypes file (now with a single genotype call per individual) for all individuals at every site:
perl getSNPpositions.pl 'All.sitelist' <NumIndivs> /VCF_Dir1/ /VCF_Dir2/ … /VCF_DirN/
perl ../Scripts/filterSNPs.pl 'All.sitelist' 'All.genotypes'
Finally, merge the collective genotypes file with the merged coverage file from step 7, to fill in reference base information for individuals with X’s:
sort -n All.genotypes >All_sorted.genotypes
sort -n Final_Cleaned.cov >All_Pops_sorted.cov
perl mergeSNPcov.pl 'All_sorted.genotypes' 'All_sorted.cov' 'All.snps' <NumInd>
Once you have created the BAM files in step 6, you can really proceed with any SNP calling program of your choice, as they will all recognize the BAM format. BUT, if you did not align to an assembled reference genome, but instead used the consensus tags as your reference as part of the de novo pipeline, the I would strongly recommend that you continue to use samtools
and bcftools
for SNP calling.
Other SNP callers that I have used, such as GATK, seem to have a hard time when there are a large number of "contigs" in the reference genome (regardless of how many total base pairs these add up to). It seems to be something to do with a position indexing step in those programs, but with ~400K tags in the consensus sequence, GATK will seemingly never get past the reference processing step and runs out of time and/or memory before it ever starts calling SNPs.