Skip to content

ChaissonLab/SegDupAnnotation2

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Snakemake Workflow: SegDupAnnotation2

Snakemake

Overview

A snakemake workflow for counting gene duplications given PacBio reads, a genome assembly, and a gene model.
Successor to SegDupAnnotation.

❗ Caution: This workflow is under active development. The main branch should run without any errors, but feel free reach out for assistance or submit a bug report in case you run into any problems or notice any errors.

Usage

Usage guide is listed in order of ease of use and portability.
Internet access is required for snakemake to download the docker image and conda environments.

Singularity

  • Install Snakemake and Singularity.
  • Clone github repository.
  • Create configuration file per config/README.md specifications.
    • If necessary don't forget to use the 'override_mem' and 'override_num_cores' parameters.
  • From repository root, run (which will automatically download and run the dockerfile from within singularity):
    • snakemake -c 1 -j 250 --use-singularity --use-conda --singularity-args " --bind \<path to an input file\>\[,\path to another input file\] "
    • Don't forget to bind paths of input files per the given config file.

Bare Metal with Conda

  • Install Snakemake and Mamba.
  • Clone github repository.
  • Create configuration file per config/README.md specifications.
  • From repository root, run (which will automatically download and use pre-defined conda environments):
    • snakemake -c 1 -j 250 --use-conda

Bare Metal with Conda and SLURM

  • Install Snakemake and Mamba.
  • Clone github repository.
  • Create configuration file per config/README.md specifications.
  • From repository root, run (which will automatically download and use pre-defined conda environments):
    • snakemake -c 1 -j 250 --use-conda --slurm --default-resources slurm_account=\<your SLURM account\> slurm_partition=\<your SLURM partition\>, or
    • snakemake -c 1 -j 250 --use-conda --cluster "sbatch -c {resources.cpus_per_task} --mem={resources.mem_mb}MB --time={resources.runtime} --account=\<your SLURM account\> --partition=\<your SLURM partition\>"

Bare Metal

  • Ensure all dependencies are installed on the system. For a list reference 'workflow/envs'.
  • Clone github repository.
  • Create configuration file per config/README.md specifications.
  • From repository root, run:
    • snakemake -c 1 -j 250

My Bare Metal with SLURM commands

  • conda activate sda
  • Then I run:
    • snakemake -c 1 -j 250 -k --slurm --default-resources slurm_account=mchaisso_100 slurm_partition=qcb, or
    • snakemake -c 1 -j 250 -k --use-conda --cluster "sbatch -c {resources.cpus_per_task} --mem={resources.mem_mb}MB --time={resources.runtime} --account=mchaisso_100 --partition=qcb --output=slurm-logs/slurm-%j.out"

Salient Output File Specifications

results/G01_dups_<isoform_grouping_type>.bed

Column Description
#chr Gene copy's position in assembly.
start ^
end ^
gene Gene name.
orig_chr Position of gene copy's original copy in assembly.
orig_start ^
orig_end ^
strand Strand on which gene copy is on: 0 for 'Original', '1' for reverse.
haplotype The haplotype of the hit. This requires 'haplotype1' or 'haplotype2' to be explicitly stated in the chromosome/scaffold/contig name.
p_identity Similarity to 'Original' gene calculated as: #matches/(#matches+#mismatches+#insertion_events+#deletion_events)
p_accuracy Similarity to 'Original' gene calculated as : #matches/(#matches+#mismatches+#insertions+#deletions)
identity Notes whether copy is the 'Original' or a resolved 'Copy'.
depth Mean gene depth over mean assembly depth.
depth_stdev Standard deviation of gene depth over mean assembly depth calculated using 100 bp bins.
copy_num Rounded depth value.
depth_by_vcf Copy number as determined by hmm's vcf output.

results/G03_per_gene_counts_<isoform_grouping_type>.tsv

Column Description
gene Gene name.
depthNormalizedToAsm Mean of depth across all gene copies divided by mean assembly depth.
depth Sum of depth across all gene copies.
measuredCov depth rounded to nearest whole number.
copyCount Number of total gene copies - resolved plus collapsed.
resolvedCount Number of resolved gene copies.

results/G04_summary_stats_<isoform_grouping_type>.tsv
Contains tab delimited summary statistics.

Tips

If rerunning pipeline using the same input assembly and reads, retain and copy over files A01_assembly.fasta and A04_assembly.bam as aligning reads to assembly is the longest step.