Skip to content

Database updating tutorial: adding genomes

Nick Youngblut edited this page May 23, 2022 · 10 revisions

This is a tutorial for updating >= Struo2-generated custom database.

Please first read the README for instructions on general setup of Struo2.

Setup

This tutorial will use the following:

Database

This tutorial assumes that you've created your custom databases in the ./data/ directory.

# Create a directory to hold all of the necessary Struo2 data files
# The directory location can be anywhere on your file system
OUTDIR=./data/
mkdir -p $OUTDIR

Reference genomes

These are the genomes that you will add to the existing databases.

wget --directory-prefix $OUTDIR http://ftp.tue.mpg.de/ebio/projects/struo2/dev_data/genomes/GTDBr95_n5.tar.gz
tar -pzxvf $OUTDIR/GTDBr95_n5.tar.gz --directory $OUTDIR

Samples table

This table lists all of the reference genomes. See this example table.

Note: you only need the following columns in the table (all others are ignored):

  • samples_col: 'ncbi_organism_name'
  • accession_col: 'accession'
  • fasta_file_path_col: 'fasta_file_path'
  • taxID_col: 'gtdb_taxid'
  • taxonomy_col: 'gtdb_taxonomy'

Config

The snakemake pipeline config is a bit different for updating a database versus generating a new database.

Below is an example config:

#-- email notifications of pipeline success/failure (use "Skip" to deactivate) --#
email: None

#-- databases to update --#
# Replace "Create" with "Skip" to skip creation of any of these
# Note that braken relies on the kraken2 database
databases:
  kraken2: Create
  bracken: Create
  genes: Create     # "Skip" to skip adding genes from the new genomes to the "genes" database
  humann3_bowtie2: Create  # "Skip" to skip adding genes from the new genomes to the "humann3_bowtie2" database
  humann3_diamond: Create  # "Skip" to skip adding genes from the new genomes to the "humann3_diamond" database

#-- Input --#
#--- If just a set of gene sequences to add ---#
# If you have nucleotide/amino-acid gene sequences formatted for humann
# If translate = True, missing nuc or AA seqs will be (rev)translated from the other, else seqs not used
new_genes:   
  amino_acid: data/UniRef50/genome_reps_filtered.faa.gz
  nucleotide: data/UniRef50/genome_reps_filtered.fna.gz
  metadata: data/genome_reps_filtered.txt.gz
  translate: False

#--- If a set of genomes to add ---#
# file listing samples and associated data
samples_file: data/GTDBr95_n10/GTDBr95_n5.tsv

## column names in samples table
samples_col: 'ncbi_organism_name'
accession_col: 'accession'
fasta_file_path_col: 'fasta_file_path'
taxID_col: 'gtdb_taxid'          # or 'ncbi_species_taxid'
taxonomy_col: 'gtdb_taxonomy'    # or 'ncbi_taxonomy' 

# Saved databases that will be updated
kraken2_db:
  library:  tests/output/GTDBr95_n10/kraken2/library/
  taxonomy: tests/output/GTDBr95_n10/kraken2/taxonomy/
genes_db:
  genes:
    mmseqs_db:  tests/output/GTDBr95_n10/genes/genes_db.tar.gz
    amino_acid: tests/output/GTDBr95_n10/genes/genome_reps_filtered.faa.gz
    nucleotide: tests/output/GTDBr95_n10/genes/genome_reps_filtered.fna.gz
    metadata:   tests/output/GTDBr95_n10/genes/genome_reps_filtered.txt.gz
  cluster:
    mmseqs_db:  tests/output/GTDBr95_n10/genes/cluster/clusters_db.tar.gz    
humann_db:
  query:
    hits: tests/output/GTDBr95_n10/humann3/annotation_hits.gz
  cluster:
    reps: tests/output/GTDBr95_n10/genes/cluster/clusters_reps.faa.gz
    membership: tests/output/GTDBr95_n10/genes/cluster/clusters_membership.tsv.gz

#-- Output --#
# output location
output_dir: tests/output/GTDBr95_n10-n5/

# Name of UniRef clustering (uniref90 or uniref50)
## "uniref90" highly recommended
uniref_name: uniref50
# Name of the humann3 diamond database to create
## This must match naming allowed by humann3
dmnd_name: uniref50_201901.dmnd   # UniRef90 is recommended
# Index mapping UniRef90 clusters to UniRef50 (saves time vs re-annotating)
## Skip if annotating with UniRef50
cluster_idx: data/uniref50-90.pkl

# temporary file directory (your username will be added automatically)
tmp_dir: tmp/db_update_tmp/

#-- if custom NCBI/GTDB taxdump files, "Skip" if standard NCBI taxdump --#
# Used for kraken taxonomy & metaphlan
names_dmp: data/taxdump/names.dmp
nodes_dmp: data/taxdump/nodes.dmp

#-- keep intermediate files required for re-creating DBs (eg., w/ more genomes) --#
# If "True", the intermediate files are saved to `output_dir`
# Else, the intermediate files are temporarily stored in `temp_folder`
keep_intermediate: True

#-- software parameters --#
# `vsearch_per_genome` = per-genome gene clustering
# for humann3, use either mmseqs or diamond (mmseqs gets priority if neither skipped)
# for humann3::mmseqs_search::run, --num-iterations must be >=2
params:
  ionice: -c 3
  bracken:
    build_kmer: 35
    build_read_lens:
      - 100
      - 150
  genes:
    prodigal: ""
    vsearch_per_genome: --id 0.97 --strand both --qmask none --fasta_width 0
    mmseqs_cluster_update: --min-seq-id 0.9 -c 0.8 -s 4.0    
  humann3:
    batches: 2
    filter_existing: --min-pident 0  # any existing genes w/ < cutoff with be re-queried
    mmseqs_search:
      db: data/UniRef90/uniref90       
      index: -s 6
      run: -e 1e-3 --max-accept 1 --max-seqs 100 --num-iterations 2 --start-sens 1 --sens-steps 3 -s 6
    diamond:
      db: Skip # data/uniref90_ec-filtered/uniref90_ec_filt_201901.dmnd
      run: --evalue 1e-3 --query-cover 80 --id 90 --max-target-seqs 1 --block-size 4 --index-chunks 2
    propagate_annotations: --min-cov 80 --min-pident 90

#-- snakemake pipeline --#
pipeline:
  snakemake_folder: ./
  script_folder: ./bin/scripts/
  name: Struo2_db-update
  config: update

See here for general notes about the config, regardless of database creation or updating.

Notes specific to database updating:

  • kraken2_db:, genes_db:, and humann_db: specify the locations for the existing database files
  • WARNING: use a different output_dir: other than where the existing databases are located; otherwise, the database files may be over-written!

Pipeline run

See the snakemake docs for general instructions.

First, a dry run:

snakemake --use-conda -j -Fqn

Now, an actual run with 4 cores:

snakemake --use-conda -j 2 -F

See the README for running snakemake on a cluster (recommended).

Output

See the README for details on the output.

A good quick sanity check is to compare the size of the updated databases versus the size of the original database files. The updated file sizes should be larger.