Data folder on Linospadix: /data_vol/wolf/Dypsis_subtribes/
original_data
: raw read files with original naming, cf. sampling.xlsxoriginal_data_renamed
: renamed read files for compatibility with SECAPR (see 1. below). Contents deleted after trimming to save space on disk.fastqc_results
: results of fastqc check run via SECAPRraw
: fastqc results for raw reads (as inoriginal_data_renamed
)trimmed
: fastqc results after trimming (as intrimmed
)
trimmed
: trimmed readsassembly
: HybPiper resultscoverage
: output of coverage trimming stepseq_sets2
: sequence sets after coverage trimming and length filteringalignments
: aligned sequence setsalignments_exon
: aligned sequence sets with mapped exon sequences for partitioningoptrimal
: working directory for dynamic alignment trimming with optrimAliqtree
: initial gene trees (pre TreeShrink)speciestree
: speciestree
Repository location on GIS07: ~/scripts/dypsidinae
Analysis folder on Macbook: ~/Documents/WOLF/PROJECTS/65 Dypsis systematics paper/analysis
Rename read files to four-digit names for compatibility with SECAPR. NB this has now options for ingroup and outgroup - check before running.
-
Run
rename4secapr.py
to generate a bash scriptrename4secapr.sh
with file copy commands. Requiressampling.xls
(adjust path in script!). This is the reason why a bash script is generated rather than usingsubprocess
, as the sampling table is on my local computer but the renaming needs to be done on the server. -
Run
rename4secapr.sh
from the data folder (see above). This creates a renamed copy of all files inoriginal_data
inoriginal_data_renamed
.
SECAPR quality check (!has to be run from within secapr_env!)
secapr quality_check --input original_data_renamed --output fastqc_results/raw
This takes about 90 minutes on the server.
PDF results stored in repo in fastqc_results/raw
.
Run in original_data renamed
:
ls *R1* | parallel -j 4 ~/scripts/dypsidinae/trimmer.sh
Trimmomatic settings used: ILLUMINACLIP:TruSeq3-PE-2.fa:2:30:10:1:true LEADING:3 TRAILING:3 MAXINFO:40:0.5 MINLEN:36
Trimmomatic v. 0.39
This takes <90min on the server.
Combine paired reads and singles again for comparability (created temporary directory trimmed_for_fastqc
- this is deleted again after this step to save space). Run from within trimmed
:
ls *READ1.fastq | parallel ~/scripts/dypsidinae/combine_posttrim_4_fastqc.sh
secapr quality_check --input trimmed_for_fastqc --output fastqc_results/trimmed
Run in trimmed
:
ls *1-single.fastq | parallel -j 16 ~/scripts/dypsidinae/single_combiner.sh
This merges ####_clean-READ1-single.fastq
and ####_clean-READ2-single.fastq
into a single file, ####_clean-READ12-single.fastq
.
Run in trimmed
:
ls *READ2.* > namelist.txt
sed -i'.old' -e 's/_clean-READ2.fastq//g' namelist.txt
mv namelist.txt ../assembly/
rm namelist.txt.old
Run ~/scripts/dypsidinae/piper.sh
from within assembly
.
NB: Check that the correct target file (PhyloPalms) is selected in piper.sh
!
From within assembly
run:
python /usr/local/bioinf/HybPiper/get_seq_lengths.py /data_vol/wolf/PhyloPalms/PhyloPalms_loci_renamed_794-176_HEYcorrected.fasta namelist_full.txt dna > test_seq_lengths.txt
python /usr/local/bioinf/HybPiper/hybpiper_stats.py test_seq_lengths.txt namelist.txt > test_stats.txt
Check for paralog warnings:
while read i
do
echo $i
python /usr/local/bioinf/HybPiper/paralog_investigator.py $i 2>> paralogreport.txt
done < namelist.txt
sed -i'.old' -e's/ paralogs written for /;/g' paralogreport.txt
In R:
data <- read.table("paralogreport.txt", sep=";")
table(as.vector(data$V2))
write.table(unique(as.vector(data$V2)), file="paralogs", row.names=FALSE)
gene | No. paralogs |
---|---|
EGU105059594 | 19 |
EGU105042168 | 12 |
EGU105043827 | 10 |
EGU105049690 | 7 |
EGU105044846 | 6 |
HEY362 | 6 |
EGU105046168 | 5 |
EGU105059636 | 5 |
EGU105057015 | 3 |
EGU105059479 | 3 |
EGU105044758 | 2 |
EGU105033626 | 1 |
EGU105058687 | 1 |
HEY125 | 1 |
HEY728 | 1 |
sed -i'.old' -e's/"//g' paralogs
sed -i '1d' paralogs
Exclude one sample (1012) with bad recovery:
sed -e '/1012/d' namelist.txt >> namelist_reduced.txt
Run intronerate.py
:
while read name
do
echo $name >> intronerate_out_dev.txt
python /usr/local/bioinf/HybPiper/intronerate_dev.py --prefix $name &>> intronerate_out_dev.txt
done < namelist_reduced.txt
NB: intronerate_dev.py
is the development version of this script, as the release version causes an error. See here.
Create directory coverage
for coverage trimming output.
In assembly
, run:
while read name; do ~/scripts/dypsidinae/coverage.py $name; done < namelist.txt
This script does the following:
- Gather all contigs from each sample in one fasta file:
coverage/sample.fasta
- Map paired and unpaired reads to that fasta using BWA mem
- Deduplicate reads using Picard
- Calculate depth using samtools
- Mask/strip any bases with coverage <2
- Generate a new trimmed sample-level fasta:
coverage/sample_trimmed.fasta
Then, in coverage
, run:
ls *trimmed.fasta > filelist.txt
~/scripts/dypsidinae/samples2genes.py > outstats.csv
This script does the following:
- Split the sample-level fasta files up and sorts their sequences into genes.
- Remove any sequences shorter than 150bp or 20% of the median sequence length of the gene
- Generate new gene fasta files in
seq_sets2
These are ready for alignment.
Create directory alignments
.
Run from seq_sets2
:
Clean up sequence names:
for f in *.FNA; do (sed -i'.old' -e $'s/-HEY[0-9]\+[p,n,s,e]* [0-9]\+-HEY[0-9]\+[p,n,s,e]*_HEY[0-9]\+[p,n,s,e]* [0-9]\+-HEY[0-9]\+[p,n,s,e]*//g' $f); done
for f in *.FNA; do (sed -i'.old' -e $'s/-EGU[0-9]\+[p,n,s,e]* [0-9]\+-EGU[0-9]\+[p,n,s,e]*_EGU[0-9]\+[p,n,s,e]* [0-9]\+-EGU[0-9]\+[p,n,s,e]*//g' $f); done
rm *.old
~/scripts/dypsidinae/occupancy_stats.py
Align:
for f in *.FNA; do (linsi --adjustdirectionaccurately --thread 16 $f > ../alignments/${f/.FNA}_aligned.fasta); done
In alignments
, run:
~/scripts/dypsidinae/exon_mapper.py
This creates new alignments in alignments_exon
that contain the original alignments plus the exon sequences of the two species that had the highest recovery success at each locus.
Copy alignments to new directory optrimal
(this is necessary as the alignments will get deleted):
mkdir optrimal
cp alignments_exon/*.fasta optrimal
In that directory, generate cutoff_trim.txt
with desired -gt
values to be tested.
Then, from optrimal
:
Prepare alignments:
# replace n's with gaps in alignmenets - this will otherwise trip up TrimAl
for f in *.fasta; do (sed -i'.old' -e 's/n/-/g' $f); done
# change back "exo" to "exon"
for f in *.fasta; do (sed -i'.old' -e 's/exo-/exon/g' $f); done
Run optrimal:
# create summary tables for all thresholds specified
~/scripts/dypsidinae/PASTA_taster.sh
# create summary table for the raw alignments
python3 /home/au265104/.local/lib/python3.6/site-packages/amas/AMAS.py summary -f fasta -d dna -i *.fasta
mv summary.txt summary_0.txt
rm *.fasta
Rscript --vanilla ~/scripts/dypsidinae/optrimAl.R
NB: optrimAL.R
was modified as to NOT discard alignments with data loss exceeding 30% (cf. Shee et al. 2020). Excessive "data loss" is probably an artefact of alignment error.
Create directory iqtree
and copy all trimmed alignments from optrimal
to this directory.
Remove paralogous loci:
while read l
do
rm ${l}_aligned.fasta
done < ../assembly/paralogs
Then run:
for f in *.fasta; do(sed -i'.old' -e 's/ [0-9]\+ bp//g' $f); done
rm *.old
~/scripts/dypsidinae/partitioner.py --smoother 10
for f in *_part.txt; do (cp $f ${f/_part.txt}_clean.part); done
ls *clean.fasta | parallel -j 6 ~/software/iqtree-2.0.6-Linux/bin/iqtree2 -s {} -T AUTO -ntmax 4 -p {.}.part -B 1000
NB: one gene (EGU105046518) had no intron, resulting in an empty intron partition. The tree for this had to be run manually:
~/software/iqtree-2.0.6-Linux/bin/iqtree2 -s EGU105046518_aligned_clean.fasta -T AUTO -ntmax 4 -B 1000
Create directory speciestree
.
Remove genetrees that cannot be rooted and thus cannot be used downstream (in iqtree
):
mkdir noroot
for f in *.treefile; do (~/scripts/dypsidinae/remove_noroot.py $f); done
From iqtree
, run:
for f in *.treefile
do
~/scripts/dypsidinae/rooter.py $f
nw_ed temp.tre 'i & (b<30)' o >> ../speciestree/genetrees.tre
rm temp.tre
done
Then, in speciestree
, run:
java -jar ~/software/Astral/astral.5.7.3.jar -i genetrees.tre -o astral_tree.tre 2> astral.log
~/scripts/dypsidinae/renamer.py ../rename.csv astral_tree.tre astral_tree_renamed.tre
java -jar ~/software/Astral/astral.5.7.3.jar -q astral_tree.tre -i genetrees.tre -o astral_tree_full_annot.tre -t 2 2> annotation.log
(NB: make sure to first remove the alignments of the genetrees that have been excluded due to lacking outgroup)
In iqtree
, run:
python3 /home/au265104/.local/lib/python3.6/site-packages/amas/AMAS.py summary -f fasta -d dna -i *_clean.fasta
mv summary.txt summary_all.txt
mkdir stats # copy alignments and partition files to separate dir for splitting into exons and introns
cp *.part stats
cp *_clean.fasta stats
cd stats
rm EGU105046518* # remove the one alignment that is exon only
for f in *.part # reformat partition files for AMAS
do
sed -i'.old' -e's/DNA, //g' $f
done
for f in *clean.fasta # split alignments into intron and exon
do
python3 /home/au265104/.local/lib/python3.6/site-packages/amas/AMAS.py split -f fasta -d dna -i $f -l ${f/_clean.fasta}_clean.part -u fasta
done
cp ../EGU105046518_aligned_clean.fasta EGU105046518_aligned_clean_exon-out.fas # "rescue" exon-only gene
python3 /home/au265104/.local/lib/python3.6/site-packages/amas/AMAS.py summary -f fasta -d dna -i *exon-out.fas
mv summary.txt summary_exon.txt
python3 /home/au265104/.local/lib/python3.6/site-packages/amas/AMAS.py summary -f fasta -d dna -i *intron-out.fas
mv summary.txt summary_intron.txt