This is my adoption of the original forked repo's rna seq start deseq2 worflow, but to use AWS Parallel Cluster, via the pcluser slurm snakemake executor.
This workflow performs a differential gene expression analysis with STAR and Deseq2.
From the headnode, proceed to the following steps.
- Have conda activated.
git clone git@github.com:Daylily-Informatics/rna-seq-star-deseq2.git
cd rna-seq-star-deseq2
conda create -n snakemake -c conda-forge snakemake==8.24 snakedeploy tabulate yaml
conda activate snakemake
pip install snakemake-executor-plugin-pcluster-slurm==0.0.25
conda activate snakemake
snakemake --version
8.20.6
you are advised to run the following in a tmux or screen session
conda activate snakemake
# Set your cache dir for saving resources useful across other jobs, snakemake uses this when the `--cache` flag is set.
mkdir /fsx/resources/environments/containers/ubuntu/rnaseq_cache/
export SNAKEMAKE_OUTPUT_CACHE=/fsx/resources/environments/containers/ubuntu/rnaseq_cache/
# I set a partition relevant to my install, but if you specify nothing, you will get an error along the lines of <could not find appropriate nodes>.
snakemake --use-conda --use-singularity --singularity-prefix /fsx/resources/environments/containers/ubuntu/ --singularity-args " -B /tmp:/tmp -B /fsx:/fsx -B /home/$USER:/home/$USER -B $PWD/:$PWD" --conda-prefix /fsx/resources/environments/containers/ubuntu/ --executor pcluster-slurm --default-resources slurm_partition=i64,i96,i192 --cache -p --verbose -k --max-threads 20000 --cores 20000 -j 14 -n --conda-create-envs-only
- there seems to be a bug which requires you to run with `--conda-create-envs-only` first...
- another bug with how snakemake detects max allowd threads per job limits the threads to the `nproc` of your head node. Setting `--max-threads 20000 --cores 20000` gets around this crudely.
- Remove the
-n
flag, and run not in dryrun mode. -j
sets the max jobs slurm will allow active at one time.- Watch your running nodes/jobs using
squeue
(also,q
cluster commands work, but not reliably and are not supported). - note: it seems a bug in this example causes a few jobs to fail (investigating)
Use sinfo
to learn about your cluster (note, sinfo
reports on all potential and active compute nodes. Read the docs to interpret which are active, which are not yet requested spot instances, etc). Below is what the daylily AWS parallel cluster looks like.
sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
i8* up infinite 12 idle~ i8-dy-gb64-[1-12]
i64 up infinite 16 idle~ i64-dy-gb256-[1-8],i64-dy-gb512-[1-8]
i96 up infinite 16 idle~ i96-dy-gb384-[1-8],i96-dy-gb768-[1-8]
i128 up infinite 28 idle~ i128-dy-gb256-[1-8],i128-dy-gb512-[1-10],i128-dy-gb1024-[1-10]
i192 up infinite 30 idle~ i192-dy-gb384-[1-10],i192-dy-gb768-[1-10],i192-dy-gb1536-[1-10]
a192 up infinite 30 idle~ a192-dy-gb384-[1-10],a192-dy-gb768-[1-10],a192-dy-gb1536-[1-10]
- As I look at this, it is possible that if unset, the partition will default to
i8
in the output above. Maybe.
I etensively make use of Cost allocation tags with AWS ParallelCluster in the daylily omics analysis framework $3 30x WGS analysis to track AWS cluster usage costs in realtime, and impose limits where appropriate (by user and project). This makes use of overriding the --comment
flag to hold project/budget
tags applied to ephemeral AWS resources, and thus enabling cost tracking/controls.
- To change the --comment flag in v
0.0.8
of the pcluster-slurm plugin, set the comment flag value in the envvarSMK_SLURM_COMMENT=RandD
(RandD is the default).