Skip to content
martinghunt edited this page Jan 3, 2024 · 24 revisions

Installation

The recommended method is to use a pre-built Docker or Singularity container.

Both the Docker and Singularity container have the main script viridian_workflow installed.

Docker

Get a Docker image of the latest release:

docker pull ghcr.io/iqbal-lab-org/viridian_workflow:latest

All Docker images are listed in the packages page.

To build a docker container, clone this repository and then from its root run:

docker build --network=host .

(without --network=host you will likely get pip install timing out and the build failing).

Singularity

Releases include a Singularity image to download. Each release has a singularity image file called viridian_workflow_vX.Y.Z.img, where X.Y.Z is the release version.

To build a singularity container, clone this repository and then from its root run:

singularity build viridian_workflow.img Singularity.def

Basic Usage

The examples below will run the default pipeline, using the built-in SARS-CoV-2 amplicon schemes (at the time of writing) ampliseq, ARTIC V3-5, and Midnight-1200. The pipeline automatically detects the scheme that best matches the input reads. To use your own amplicon scheme and/or force the choice of scheme, please read the amplicon schemes page.

Viridian needs one of the following input sequencing data:

  • Illumina reads, can be paired or unpaired
  • Oxford Nanopore reads
  • Ion Torrent reads, can be paired on unpaired

To run on paired Illumina reads:

viridian run_one_sample \
  --tech illumina
  --reads1 reads_1.fastq.gz \
  --reads2 reads_2.fastq.gz \
  --outdir OUT

To run on unpaired nanopore reads:

viridian run_one_sample \
  --tech ont
  --reads reads.fastq.gz \
  --outdir OUT

The allowed values of --tech are illumina, iontorrent, ont. Nanopore reads must be unpaired (ie use the --reads option). Illumina and iontorrent reads can be paired or unpaired.

The option --outdir is required - Viridian will make a new output directory with the give value. In the above examples, this is called OUT. Some combination of reads options is required, as explained in detail below.

Output files

The default files in the output directory are:

  • consensus.fa.gz: a gzipped FASTA file of the final masked consensus sequence.
  • variants.vcf: a VCF file of the identified variants between the consensus sequence and the reference genome.
  • log.json.gz: a gzipped JSON file of logging information for the viridian workflow run. This is described in detail in the JSON output file page.
  • qc.tsv.gz: a gzipped tab-delimited file containing per-base QC information about the consensus sequence. This is described in detail in the QC TSV file page.
  • scheme_id.depth_across_genome.pdf: a plot of the read depth across the genome, with amplicons coloured in the background.
  • scheme_id.score_plot.pdf: a plot of the scoring for amplicon scheme identification.

Other output file options:

  • --keep_bam is used, then a sorted BAM file of the reads mapped to the reference will also be present, called reference_mapped.bam (and its index file reference_mapped.bam.bai).
  • --write_msa MSA_TYPE: this is useful for building trees (for example, input to usher). All MSA variations are always in the file log.json, but this option lets you write some of them to FASTA files. It can be used more than once. Each sequence is written to separate FASTA file. Choose from: indel_as_N and/or indel_as_ref. Both options output the consensus sequence, but any insertions with respect to the reference are removed so that the consensus sequence length is the same as the reference length. Deletions (ie gaps) in the consensus can be replaced with N using indel_as_N, or with the reference sequence using indel_as_ref.

Command line options

Input reads

The reads can be in FASTA, FASTQ or BAM format. FASTA/Q files can optionally be gzipped. Use the following options depending on the format:

  • Unpaired FASTA/FASTQ: --reads my_reads.fastq (or --reads my_reads.fastq.gz)
  • Paired FASTA/FASTQ: --reads1 reads_1.fastq --reads2 reads_.fastq (or --reads1 reads_1.fastq.gz --reads2 reads_.fastq.gz)
  • Paired or unpaired in a sorted by coordinate and indexed BAM file: --reads_bam mapped.bam. "Indexed" means that the file mapped.bam.bai must exist (made with samtools index mapped.bam).

The sequencing technology of the reads must be specified with the required option --tech, with allowed values illumina, iontorrent, or ont. Nanopore (ont) reads must be unpaired. Illumuna and Ion Torrent reads can be paired or unpaired.

Alternatively you can provide a run accession, of the form ERR12345678, using the option --ena_run ERR12345678. The reads will be downloaded and assembled, as long as they are from a supported sequencing technology. If you use this option, then the --reads* and --tech options are not required. There is also the option --keep_ena_reads, which will keep reads downloaded from the ENA. By default they are deleted at the end of the Viridian run.

Basic options

  • --sample_name MY_NAME: use this to change the sample name (default is "sample") that is put in the final FASTA file, BAM file, and VCF file.
  • --keep_bam: use this option to keep the BAM file of original input reads mapped to the reference genome.
  • --force: use with caution - it will overwrite the output directory if it already exists.
  • --tmp_dir DIRNAME: directory in which to put a temporary directory for intermediate files that are not kept. The dir given by --tmp_dir must already exist. Default is platform dependent, and uses python's tempdir default locations (eg for linux, looks for /tmp first). If you use the --debug option, then the --tmp_dir option is ignored and a directory called Processing is used in the output directory.
  • --no_gzip: use this to not gzip the final output FASTA file and JSON log file
  • --debug: produces very verbose logging output, and does not clean up intermediate files. Also affects temporary files directory (see --tmp_dir).

Decontamination

The reads can be decontaminated using readItAndKeep using the flag --decontam COVID, which uses the SARS-CoV-2 reference genome with the polyA tail removed. This option is incompatible with --reads_bam. The reads are decontaminated up front, before the usual start of the pipeline that maps reads to the reference genome.

Amplicon scheme options

The basic options related to amplicon schemes are listed below. If you want to use your own scheme, please see the detailed amplicon schemes page

  • --detect_scheme_only: map the reads and detect the amplicon scheme, then stop the pipeline instead of making a consensus etc
  • --min_scheme_score INT: minimum score required for a matching scheme. If all schemes are less than this cutoff, the pipeline is stopped unless --force_amp_scheme is used. Default is 250.
  • --max_scheme_ratio FLOAT: maximum allowed value of (second best scheme score) / (best scheme score). Default is 0.5.

Assembly, QC and masking options

We do not recommend using these options unless you know what you are doing.

Initial read depth QC options:

  • --coverage_min_x X, coverage_min_pc Y: after initial read mapping to the reference, if less than Y percent of the genome has at least X read coverage, then the pipeline is stopped. Default is to require at least 50 percent of the genome with at least 20X read depth

Assembly option:

  • --assemble_depth INT: Target read depth for each amplicon when assembling. Default is 100. This means reads are sampled to 100X within each amplicon. If an amplicon has mean depth less than 100, then all reads are kept for that amplicon.

After assembly, reads are mapped to the consensus sequence to perform more QC and mask low quality positions. The relevant options are:

  • --qc_depth INT: target coverage for each amplicon, when running QC and masking. Default is 1000. Similarly to the --assemble_depth option, the default of 1000 means reads for each amplicon are randomly sampled to 1000X, or if the mean depth is less than 1000 then all reads are kept.
  • --masking_min_depth INT: consensus positions with total reads supporting the consensus call less than this number are masked. Default is 20.
  • --masking_min_frs FLOAT. Must be a value between 0 and 1. Consensus positions with a fraction of read support (FRS) less than the given number are masked. Default is 0.5, meaning if less than half of the mapped reads agree with the consensus, then the position is masked with an N.
  • --het_min_pc FLOAT: cutoff for when to use ambiguous IUPAC codes. With --het_min_pc X, if two or more alleles each have support of more than X percent of reads, then an ambiguous IUPAC code is used and the position is given the 'HET' filter. Default is 20.0.

Other QC options:

  • --max_percent_amps_fail FLOAT: maximum percent of amplicons allowed to fail during read sampling or making consensus for each amplicon. The pipeline is stopped as soon as too many failed amplicons are detected. Default is 50.0.
  • --max_cons_n_percent FLOAT: maximum allowed percentage of Ns in the consensus sequence. Pipeline is stopped as soon as too many Ns in current consensus. Default is 50.0.
Clone this wiki locally