Assembly steps

Data QC

for file in *.fastq.gz; do
  fastqc --threads 16 $file
done

MaSuRCA parameters and input files were configured in the config file sr_config.txt, and the it was run as shown in masurca.slurm file.

The input data is as shown:

Filename	Fragment size	Length	Counts	Total Counts	Coverage (X)
Sample_HUSA1-1	250bp	150bp	78,167,479	156,334,958	13.03
Sample_HUSA1-2	250bp	150bp	76,047,370	152,094,740	12.67
Sample_HUSA1-1-9K	9kb	150bp	35,573,143	71,146,286	5.93
Sample_HUSA1-2-9K	9kb	150bp	34,741,875	69,483,750	5.79
Sample_HUSA1-1-11K	11kb	150bp	33,067,935	66,135,870	5.51
Sample_HUSA1-2-11K	11kb	150bp	32,362,828	64,725,656	5.39

redundans was run using the script runRedundans.sh, with the PBS file redundans.sub.

BUSCO was run to evaluate the gene space completeness as shown in assembly-busco.sh. The summary output is available here.

First, repeats were annotated using EDTA program with the script assembly-edta.sh. After completion, LTR_retriever was run using the script assembly-ltr_retreiver.sh

For contamination screening Blobtools was used.

short reads were mapped to the assembly using the script runBWAmem.sh.
reads were BLAST searched against NCBI refseq database using runMegablast.sh
blobtools was run using the above outputs and the genome assembly with the script blobtools.sh