Skip to content

2 Prepare genomes workflow

Antton Alberdi edited this page Mar 20, 2021 · 16 revisions

An essential step before running the pre-processing workflow is to prepare the reference genomes we will use for mapping the metagenomic reads agains the host(s) using the HoloFlow preparegenomes (PRG) workflow. This workflow merges and indexes the reference genomes to be used in HoloFlow. In this examples we will combine the human and chicken genomes for splitting the raw reads between human, chicken and the metagenomic fraction.

2.1- Download reference genomes

You can find many reference genomes in https://www.ncbi.nlm.nih.gov/genome.

Before downloading any genome go to the genomes directory

cd ${workdir}/genomes

Use wget to retrieve genome sequences and store them in the genomes directory.

# Chicken genome
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/002/315/GCF_000002315.6_GRCg6a/GCF_000002315.6_GRCg6a_genomic.fna.gz
# Human genome
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.39_GRCh38.p13/GCF_000001405.39_GRCh38.p13_genomic.fna.gz

2.2- Prepare HoloFlow PRG files

To run Holoflow PRG you will need to prepare two files.

  • PRG.txt contains the information about the genomes you want to prepare.
  • PRG.sh is the shell script you will use to launch HoloFlow PRG.

We advice to store PRG.txt file in the input folder, and the PRG.sh file in the jobs folder.

2.2.1- Prepare PRG input file (input/PRG.txt)

The PRG input file defines the names of the reference genomes, their location and their destination file in three different columns separated by spaces. One single PRG job can create multiple sets of reference genomes. Not that columns must be space-separated.

  • Column 1: name of the genome (this can be customised but should be different for each genome). This is the information that will be then used by the genomic workflow to split the reads mapped to different genomes.
  • Column 2: absolute path to the genome file.
  • Column 3: final destination file name of combined genomes. It is possible to create two sets of reference genomes at the same time by using different names in column 3. This name cannot be the same as any name declared in column 1.

The document can be created and edited with the text editor nano.

nano ${workdir}/input/PRG.txt

Prepare the contents in a simple text editor (Atom, Brackets, Textedit...) and paste it in the console. Do not directly copy from enriched text editors such as Word, Google Docs or Excel, as these can drag hidden format information that can interfere with the scripts. The document should look like this:

GRCg6a /home/projects/ku-cbd/people/antalb/holoproject/genomes/GCF_000002315.6_GRCg6a_genomic.fna.gz GgalHsap
GRCh38 /home/projects/ku-cbd/people/antalb/holoproject/genomes/GCF_000001405.39_GRCh38.p13_genomic.fna.gz GgalHsap

Type "Ctrl+X", then "Y" and finally "enter" to save the file.

Check the input file has been properly saved and review its contents.

cat ${workdir}/input/PRG.txt

2.2.2- Prepare PRG job file (jobs/PRG.txt)

The document can be created and edited again with the text editor nano.

nano ${workdir}/jobs/PRG.sh

Prepare the contents in a simple text editor (Atom, Brackets, Textedit...) and paste it in the console.

python ${holodir}/preparegenomes.py -f ${workdir}/input/PRG.txt -d ${workdir} -t 40

Check the job file has been properly saved and review its contents

cat ${workdir}/jobs/PRG.sh

2.3- Launch the HoloFlow PRG job using qsub

Whenever you run HoloFlow it is important to unload modules to avoid conflicts with the modules that HoloFlow will automatically load. To check whether you have any module loaded use the following command:

module list

Ideally, this should prompt:

No Modulefiles Currently Loaded.

If that is not the case, use module unload XXX/X.X.X to unload the modules. E.g.:

module unload samtools/1.9.0

Repeat that operation until all modules but anaconda3/4.4.0 have been unloaded. If anaconda3 is not loaded, load it before launching the job using qsub. Usually this job will take a few hours, so a walltime of 1 day (walltime=1:00:00:00) will suffice. If you are processing multiple big genomes at once, consider extending the walltime to 2-3 days (e.g. walltime=3:00:00:00).

module load tools anaconda3/4.4.0
qsub -V -A ku-cbd -W group_list=ku-cbd -v "workdir=${workdir},holodir=${holodir}" -d `pwd` -e ${workdir}/jobs/PRG.err -o ${workdir}/jobs/PRG.out -l nodes=1:ppn=40,mem=180gb,walltime=2:00:00:00 -N PRG ${workdir}/jobs/PRG.sh

To check the job status use qstat. If the job is running, column S will show "R". When the job is completed, column S will show "C". Genome merging and indexing will take a few hours to be completed. Genomes will be compressed in a tarball (genomename.tar.gz) in your project directory. Error and output contents will be redirected to ${workdir}/jobs/PRG.err and ${workdir}/jobs/PRG.out files. You should not start the preprocessing workflow before finishing the PRG workflow!