A snakemake workflow for counting gene duplications given PacBio reads, a genome assembly, and a gene model.
Successor to SegDupAnnotation.
❗ Caution: This workflow is under active development. The main branch should run without any errors, but feel free reach out for assistance or submit a bug report in case you run into any problems or notice any errors.
Usage guide is listed in order of ease of use and portability.
Internet access is required for snakemake to download the docker image and conda environments.
- Install Snakemake and Singularity.
- Clone github repository.
- Create configuration file per config/README.md specifications.
- If necessary don't forget to use the 'override_mem' and 'override_num_cores' parameters.
- From repository root, run (which will automatically download and run the dockerfile from within singularity):
snakemake -c 1 -j 250 --use-singularity --use-conda --singularity-args " --bind \<path to an input file\>\[,\path to another input file\] "
- Don't forget to bind paths of input files per the given config file.
- Install Snakemake and Mamba.
- Clone github repository.
- Create configuration file per config/README.md specifications.
- From repository root, run (which will automatically download and use pre-defined conda environments):
snakemake -c 1 -j 250 --use-conda
- Install Snakemake and Mamba.
- Clone github repository.
- Create configuration file per config/README.md specifications.
- From repository root, run (which will automatically download and use pre-defined conda environments):
snakemake -c 1 -j 250 --use-conda --slurm --default-resources slurm_account=\<your SLURM account\> slurm_partition=\<your SLURM partition\>
, orsnakemake -c 1 -j 250 --use-conda --cluster "sbatch -c {resources.cpus_per_task} --mem={resources.mem_mb}MB --time={resources.runtime} --account=\<your SLURM account\> --partition=\<your SLURM partition\>"
- Ensure all dependencies are installed on the system. For a list reference 'workflow/envs'.
- Clone github repository.
- Create configuration file per config/README.md specifications.
- From repository root, run:
snakemake -c 1 -j 250
conda activate sda
- Then I run:
snakemake -c 1 -j 250 -k --slurm --default-resources slurm_account=mchaisso_100 slurm_partition=qcb
, orsnakemake -c 1 -j 250 -k --use-conda --cluster "sbatch -c {resources.cpus_per_task} --mem={resources.mem_mb}MB --time={resources.runtime} --account=mchaisso_100 --partition=qcb --output=slurm-logs/slurm-%j.out"
results/G01_dups_<isoform_grouping_type>.bed
Column | Description |
---|---|
#chr | Gene copy's position in assembly. |
start | ^ |
end | ^ |
gene | Gene name. |
orig_chr | Position of gene copy's original copy in assembly. |
orig_start | ^ |
orig_end | ^ |
strand | Strand on which gene copy is on: 0 for 'Original', '1' for reverse. |
haplotype | The haplotype of the hit. This requires 'haplotype1' or 'haplotype2' to be explicitly stated in the chromosome/scaffold/contig name. |
p_identity | Similarity to 'Original' gene calculated as: #matches/(#matches+#mismatches+#insertion_events+#deletion_events) |
p_accuracy | Similarity to 'Original' gene calculated as : #matches/(#matches+#mismatches+#insertions+#deletions) |
identity | Notes whether copy is the 'Original' or a resolved 'Copy'. |
depth | Mean gene depth over mean assembly depth. |
depth_stdev | Standard deviation of gene depth over mean assembly depth calculated using 100 bp bins. |
copy_num | Rounded depth value. |
depth_by_vcf | Copy number as determined by hmm's vcf output. |
results/G03_per_gene_counts_<isoform_grouping_type>.tsv
Column | Description |
---|---|
gene | Gene name. |
depthNormalizedToAsm | Mean of depth across all gene copies divided by mean assembly depth. |
depth | Sum of depth across all gene copies. |
measuredCov | depth rounded to nearest whole number. |
copyCount | Number of total gene copies - resolved plus collapsed. |
resolvedCount | Number of resolved gene copies. |
results/G04_summary_stats_<isoform_grouping_type>.tsv
Contains tab delimited summary statistics.
If rerunning pipeline using the same input assembly and reads, retain and copy over files A01_assembly.fasta and A04_assembly.bam as aligning reads to assembly is the longest step.