Database updating tutorial: adding genomes

This is a tutorial for updating >= Struo2-generated custom database.

Please first read the README for instructions on general setup of Struo2.

Setup
Output

Setup

This tutorial will use the following:

A small set of reference genomes
GTDB taxonomy
- NCBI taxonomy is an alternative
The databases generated from the database generation tutorial
- See the ftp site for custom GTDB databases
The genomes will be added to the Kraken2 and Bracken databases
The genes from the genomes will be added to the genes, humann3_bowtie2, and humann3_diamond databases
- If you have an existing set of genes that you would like to add, see https://github.com/leylabmpi/Struo2/wiki/Database-updating-tutorial:-adding-genes

Database

This tutorial assumes that you've created your custom databases in the ./data/ directory.

# Create a directory to hold all of the necessary Struo2 data files
# The directory location can be anywhere on your file system
OUTDIR=./data/
mkdir -p $OUTDIR

Reference genomes

These are the genomes that you will add to the existing databases.

wget --directory-prefix $OUTDIR http://ftp.tue.mpg.de/ebio/projects/struo2/dev_data/genomes/GTDBr95_n5.tar.gz
tar -pzxvf $OUTDIR/GTDBr95_n5.tar.gz --directory $OUTDIR

Samples table

This table lists all of the reference genomes. See this example table.

Note: you only need the following columns in the table (all others are ignored):

samples_col: 'ncbi_organism_name'
accession_col: 'accession'
fasta_file_path_col: 'fasta_file_path'
taxID_col: 'gtdb_taxid'
taxonomy_col: 'gtdb_taxonomy'

Config

The snakemake pipeline config is a bit different for updating a database versus generating a new database.

Below is an example config:

#-- email notifications of pipeline success/failure (use "Skip" to deactivate) --#
email: None

#-- databases to update --#
# Replace "Create" with "Skip" to skip creation of any of these
# Note that braken relies on the kraken2 database
databases:
  kraken2: Create
  bracken: Create
  genes: Create     # "Skip" to skip adding genes from the new genomes to the "genes" database
  humann3_bowtie2: Create  # "Skip" to skip adding genes from the new genomes to the "humann3_bowtie2" database
  humann3_diamond: Create  # "Skip" to skip adding genes from the new genomes to the "humann3_diamond" database

#-- Input --#
#--- If just a set of gene sequences to add ---#
# If you have nucleotide/amino-acid gene sequences formatted for humann
# If translate = True, missing nuc or AA seqs will be (rev)translated from the other, else seqs not used
new_genes:   
  amino_acid: data/UniRef50/genome_reps_filtered.faa.gz
  nucleotide: data/UniRef50/genome_reps_filtered.fna.gz
  metadata: data/genome_reps_filtered.txt.gz
  translate: False

#--- If a set of genomes to add ---#
# file listing samples and associated data
samples_file: data/GTDBr95_n10/GTDBr95_n5.tsv

## column names in samples table
samples_col: 'ncbi_organism_name'
accession_col: 'accession'
fasta_file_path_col: 'fasta_file_path'
taxID_col: 'gtdb_taxid'          # or 'ncbi_species_taxid'
taxonomy_col: 'gtdb_taxonomy'    # or 'ncbi_taxonomy' 

# Saved databases that will be updated
kraken2_db:
  library:  tests/output/GTDBr95_n10/kraken2/library/
  taxonomy: tests/output/GTDBr95_n10/kraken2/taxonomy/
genes_db:
  genes:
    mmseqs_db:  tests/output/GTDBr95_n10/genes/genes_db.tar.gz
    amino_acid: tests/output/GTDBr95_n10/genes/genome_reps_filtered.faa.gz
    nucleotide: tests/output/GTDBr95_n10/genes/genome_reps_filtered.fna.gz
    metadata:   tests/output/GTDBr95_n10/genes/genome_reps_filtered.txt.gz
  cluster:
    mmseqs_db:  tests/output/GTDBr95_n10/genes/cluster/clusters_db.tar.gz    
humann_db:
  query:
    hits: tests/output/GTDBr95_n10/humann3/annotation_hits.gz
  cluster:
    reps: tests/output/GTDBr95_n10/genes/cluster/clusters_reps.faa.gz
    membership: tests/output/GTDBr95_n10/genes/cluster/clusters_membership.tsv.gz

#-- Output --#
# output location
output_dir: tests/output/GTDBr95_n10-n5/

# Name of UniRef clustering (uniref90 or uniref50)
## "uniref90" highly recommended
uniref_name: uniref50
# Name of the humann3 diamond database to create
## This must match naming allowed by humann3
dmnd_name: uniref50_201901.dmnd   # UniRef90 is recommended
# Index mapping UniRef90 clusters to UniRef50 (saves time vs re-annotating)
## Skip if annotating with UniRef50
cluster_idx: data/uniref50-90.pkl

# temporary file directory (your username will be added automatically)
tmp_dir: tmp/db_update_tmp/

#-- if custom NCBI/GTDB taxdump files, "Skip" if standard NCBI taxdump --#
# Used for kraken taxonomy & metaphlan
names_dmp: data/taxdump/names.dmp
nodes_dmp: data/taxdump/nodes.dmp

#-- keep intermediate files required for re-creating DBs (eg., w/ more genomes) --#
# If "True", the intermediate files are saved to `output_dir`
# Else, the intermediate files are temporarily stored in `temp_folder`
keep_intermediate: True

#-- software parameters --#
# `vsearch_per_genome` = per-genome gene clustering
# for humann3, use either mmseqs or diamond (mmseqs gets priority if neither skipped)
# for humann3::mmseqs_search::run, --num-iterations must be >=2
params:
  ionice: -c 3
  bracken:
    build_kmer: 35
    build_read_lens:
      - 100
      - 150
  genes:
    prodigal: ""
    vsearch_per_genome: --id 0.97 --strand both --qmask none --fasta_width 0
    mmseqs_cluster_update: --min-seq-id 0.9 -c 0.8 -s 4.0    
  humann3:
    batches: 2
    filter_existing: --min-pident 0  # any existing genes w/ < cutoff with be re-queried
    mmseqs_search:
      db: data/UniRef90/uniref90       
      index: -s 6
      run: -e 1e-3 --max-accept 1 --max-seqs 100 --num-iterations 2 --start-sens 1 --sens-steps 3 -s 6
    diamond:
      db: Skip # data/uniref90_ec-filtered/uniref90_ec_filt_201901.dmnd
      run: --evalue 1e-3 --query-cover 80 --id 90 --max-target-seqs 1 --block-size 4 --index-chunks 2
    propagate_annotations: --min-cov 80 --min-pident 90

#-- snakemake pipeline --#
pipeline:
  snakemake_folder: ./
  script_folder: ./bin/scripts/
  name: Struo2_db-update
  config: update

See here for general notes about the config, regardless of database creation or updating.

Notes specific to database updating:

kraken2_db:, genes_db:, and humann_db: specify the locations for the existing database files
- See the database generation tutorial for how to generate the files
WARNING: use a different output_dir: other than where the existing databases are located; otherwise, the database files may be over-written!

Pipeline run

See the snakemake docs for general instructions.

First, a dry run:

snakemake --use-conda -j -Fqn

Now, an actual run with 4 cores:

snakemake --use-conda -j 2 -F

See the README for running snakemake on a cluster (recommended).

Output

See the README for details on the output.

A good quick sanity check is to compare the size of the updated databases versus the size of the original database files. The updated file sizes should be larger.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly