Snakemake Workflow: `SegDupAnnotation2`

Overview

A snakemake workflow for counting gene duplications given PacBio reads, a genome assembly, and a gene model.
Successor to SegDupAnnotation.

❗ Caution: This workflow is under active development. The main branch should run without any errors, but feel free reach out for assistance or submit a bug report in case you run into any problems or notice any errors.

Usage

Usage guide is listed in order of ease of use and portability.
Internet access is required for snakemake to download the docker image and conda environments.

Singularity

Install Snakemake and Singularity.
Clone github repository.
Create configuration file per config/README.md specifications.
- If necessary don't forget to use the 'override_mem' and 'override_num_cores' parameters.
From repository root, run (which will automatically download and run the dockerfile from within singularity):
- snakemake -c 1 -j 250 --use-singularity --use-conda --singularity-args " --bind \<path to an input file\>\[,\path to another input file\] "
- Don't forget to bind paths of input files per the given config file.

Bare Metal with Conda

Install Snakemake and Mamba.
Clone github repository.
Create configuration file per config/README.md specifications.
From repository root, run (which will automatically download and use pre-defined conda environments):
- snakemake -c 1 -j 250 --use-conda

Bare Metal with Conda and SLURM

Install Snakemake and Mamba.
Clone github repository.
Create configuration file per config/README.md specifications.
From repository root, run (which will automatically download and use pre-defined conda environments):
- snakemake -c 1 -j 250 --use-conda --slurm --default-resources slurm_account=\<your SLURM account\> slurm_partition=\<your SLURM partition\>, or
- snakemake -c 1 -j 250 --use-conda --cluster "sbatch -c {resources.cpus_per_task} --mem={resources.mem_mb}MB --time={resources.runtime} --account=\<your SLURM account\> --partition=\<your SLURM partition\>"

Bare Metal

Ensure all dependencies are installed on the system. For a list reference 'workflow/envs'.
Clone github repository.
Create configuration file per config/README.md specifications.
From repository root, run:
- snakemake -c 1 -j 250

My Bare Metal with SLURM commands

conda activate sda
Then I run:
- snakemake -c 1 -j 250 -k --slurm --default-resources slurm_account=mchaisso_100 slurm_partition=qcb, or
- snakemake -c 1 -j 250 -k --use-conda --cluster "sbatch -c {resources.cpus_per_task} --mem={resources.mem_mb}MB --time={resources.runtime} --account=mchaisso_100 --partition=qcb --output=slurm-logs/slurm-%j.out"

Salient Output File Specifications

results/G01_dups_<isoform_grouping_type>.bed

Column	Description
#chr	Gene copy's position in assembly.
start	^
end	^
gene	Gene name.
orig_chr	Position of gene copy's original copy in assembly.
orig_start	^
orig_end	^
strand	Strand on which gene copy is on: 0 for 'Original', '1' for reverse.
haplotype	The haplotype of the hit. This requires 'haplotype1' or 'haplotype2' to be explicitly stated in the chromosome/scaffold/contig name.
p_identity	Similarity to 'Original' gene calculated as: #matches/(#matches+#mismatches+#insertion_events+#deletion_events)
p_accuracy	Similarity to 'Original' gene calculated as : #matches/(#matches+#mismatches+#insertions+#deletions)
identity	Notes whether copy is the 'Original' or a resolved 'Copy'.
depth	Mean gene depth over mean assembly depth.
depth_stdev	Standard deviation of gene depth over mean assembly depth calculated using 100 bp bins.
copy_num	Rounded depth value.
depth_by_vcf	Copy number as determined by hmm's vcf output.

results/G03_per_gene_counts_<isoform_grouping_type>.tsv

Column	Description
gene	Gene name.
depthNormalizedToAsm	Mean of depth across all gene copies divided by mean assembly depth.
depth	Sum of depth across all gene copies.
measuredCov	depth rounded to nearest whole number.
copyCount	Number of total gene copies - resolved plus collapsed.
resolvedCount	Number of resolved gene copies.

results/G04_summary_stats_<isoform_grouping_type>.tsv
Contains tab delimited summary statistics.

Tips

If rerunning pipeline using the same input assembly and reads, retain and copy over files A01_assembly.fasta and A04_assembly.bam as aligning reads to assembly is the longest step.

Name		Name	Last commit message	Last commit date
Latest commit History 119 Commits
config		config
workflow		workflow
.gitignore		.gitignore
.gitmodules		.gitmodules
.snakemake-workflow-catalog.yml		.snakemake-workflow-catalog.yml
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Snakemake Workflow: `SegDupAnnotation2`

Overview

Usage

Singularity

Bare Metal with Conda

Bare Metal with Conda and SLURM

Bare Metal

My Bare Metal with SLURM commands

Salient Output File Specifications

Tips

About

Releases 2

Packages

Languages

License

ChaissonLab/SegDupAnnotation2

Folders and files

Latest commit

History

Repository files navigation

Snakemake Workflow: SegDupAnnotation2

Overview

Usage

Singularity

Bare Metal with Conda

Bare Metal with Conda and SLURM

Bare Metal

My Bare Metal with SLURM commands

Salient Output File Specifications

Tips

About

Resources

License

Stars

Watchers

Forks

Releases 2

Packages 0

Languages

Snakemake Workflow: `SegDupAnnotation2`

Packages