diff --git a/CITATION.cff b/CITATION.cff new file mode 100644 index 0000000..09cb074 --- /dev/null +++ b/CITATION.cff @@ -0,0 +1,56 @@ +# This CITATION.cff file was generated with cffinit. +# Visit https://bit.ly/cffinit to generate yours today! + +cff-version: 1.2.0 +title: HPC Workflow Management with Snakemake +message: >- + If you use this software, please cite it using the + metadata from this file. +type: software +authors: + - given-names: Alan + family-names: O'Cais + email: alan.ocais@cecam.org + affiliation: University of Barcelona + orcid: 'https://orcid.org/0000-0002-8254-8752' +repository-code: 'https://github.com/carpentries-incubator/hpc-workflows' +url: 'https://carpentries-incubator.github.io/hpc-workflows/' +abstract: >- + When using HPC resources, it's very common to need to + carry out the same set of tasks over a set of data + (commonly called a workflow or pipeline). In this lesson + we will make an experiment that takes an application which + runs in parallel and investigate it’s scalability. To do + that we will need to gather data, in this case that means + running the application multiple times with different + numbers of CPU cores and recording the execution time. + Once we’ve done that we need to create a visualisation of + the data to see how it compares against the ideal case. + + + We could do all of this manually, but there are useful + tools to help us manage data analysis pipelines like we + have in our experiment. In the context of this lesson, + we’ll learn about one of those: Snakemake. +keywords: + - HPC + - Carpentries + - Lesson + - Workflow + - Pipeline +license: CC-BY-4.0 +references: + - authors: + - family-names: Collins + given-names: Daniel + title: "Getting Started with Snakemake" + type: software + repository-code: 'https://github.com/carpentries-incubator/workflows-snakemake/' + url: 'https://carpentries-incubator.github.io/workflows-snakemake/' + - authors: + - family-names: Booth + given-names: Tim + title: "Snakemake for Bioinformatics" + type: software + repository-code: 'https://github.com/carpentries-incubator/snakemake-novice-bioinformatics/' + url: 'https://carpentries-incubator.github.io/snakemake-novice-bioinformatics' \ No newline at end of file diff --git a/config.yaml b/config.yaml index ec8758f..13b4362 100644 --- a/config.yaml +++ b/config.yaml @@ -58,22 +58,22 @@ contact: 'maintainers-hpc@lists.carpentries.org' # - another-learner.md # Order of episodes in your lesson -episodes: -- amdahl_foundation.md -- snakemake_single.md -- snakemake_multiple.md -- snakemake_cluster.md -- snakemake_profiles.md -- amdahl_snakemake.md +episodes: +- 01-introduction.md +- 02-snakemake_on_the_cluster.md +- 03-placeholders.md +- 04-snakemake_and_mpi.md +- 05-chaining_rules.md +- 06-expansion.md # Information for Learners -learners: +learners: # Information for Instructors -instructors: +instructors: # Learner Profiles -profiles: +profiles: # Customisation --------------------------------------------- # diff --git a/episodes/01-introduction.md b/episodes/01-introduction.md new file mode 100644 index 0000000..159154e --- /dev/null +++ b/episodes/01-introduction.md @@ -0,0 +1,176 @@ +--- +title: "Running commands with Snakemake" +teaching: 30 +exercises: 30 +--- + +::: questions +- "How do I run a simple command with Snakemake?" +::: + +:::objectives +- "Create a Snakemake recipe (a Snakefile)" +::: + + +## What is the workflow I'm interested in? + +In this lesson we will make an experiment that takes an application which runs +in parallel and investigate it's scalability. To do that we will need to gather +data, in this case that means running the application multiple times with +different numbers of CPU cores and recording the execution time. Once we've +done that we need to create a visualisation of the data to see how it compares +against the ideal case. + +From the visualisation we can then decide at what scale it +makes most sense to run the application at in production to maximise the use of +our CPU allocation on the system. + +We could do all of this manually, but there are useful tools to help us manage +data analysis pipelines like we have in our experiment. Today we'll learn about +one of those: Snakemake. + +In order to get started with Snakemake, let's begin by taking a simple command +and see how we can run that via Snakemake. Let's choose the command `hostname` +which prints out the name of the host where the command is executed: + +```bash +[ocaisa@node1 ~]$ hostname +``` +```output +node1.int.jetstream2.hpc-carpentry.org +``` + +That prints out the result but Snakemake relies on files to know the status of +your workflow, so let's redirect the output to a file: + +```bash +[ocaisa@node1 ~]$ hostname > hostname_login.txt +``` + +## Making a Snakefile + +Edit a new text file named `Snakefile`. + +Contents of `Snakefile`: + +```python +rule hostname_login: + output: "hostname_login.txt" + input: + shell: + "hostname > hostname_login.txt" +``` + +::: callout + +## Key points about this file + +1. The file is named `Snakefile` - with a capital `S` and no file extension. +1. Some lines are indented. Indents must be with space characters, not tabs. See + the setup section for how to make your text editor do this. +1. The rule definition starts with the keyword `rule` followed by the rule name, + then a colon. +1. We named the rule `hostname_login`. You may use letters, numbers or + underscores, but the rule name must begin with a letter and may not be a + keyword. +1. The keywords `input`, `output`, `shell` are all followed by a colon. +1. The file names and the shell command are all in `"quotes"`. +1. The output filename is given before the input filename. In fact, Snakemake + doesn't care what order they appear in but we give the output first + throughout this course. We'll see why soon. +1. In this use case there is no input file for the command so we leave this + blank. + +::: + +Back in the shell we'll run our new rule. At this point, if there were any +missing quotes, bad indents, etc. we may see an error. + +```bash +$ snakemake -j1 -p hostname_login +``` + +::: callout + +## `bash: snakemake: command not found...` + +If your shell tells you that it cannot find the command `snakemake` then we need +to make the software available somehow. In our case, this means searching for +the module that we need to load: +```bash +module spider snakemake +``` + +```output +[ocaisa@node1 ~]$ module spider snakemake + +-------------------------------------------------------------------------------------------------------- + snakemake: +-------------------------------------------------------------------------------------------------------- + Versions: + snakemake/8.2.1-foss-2023a + snakemake/8.2.1 (E) + +Names marked by a trailing (E) are extensions provided by another module. + + +-------------------------------------------------------------------------------------------------------- + For detailed information about a specific "snakemake" package (including how to load the modules) use the module's full name. + Note that names that have a trailing (E) are extensions provided by other modules. + For example: + + $ module spider snakemake/8.2.1 +-------------------------------------------------------------------------------------------------------- + +``` + +Now we want the module, so let's load that to make the package available + +```bash +[ocaisa@node1 ~]$ module load snakemake +``` + +and then make sure we have the `snakemake` command available + +```bash +[ocaisa@node1 ~]$ which snakemake +``` +```output +/cvmfs/software.eessi.io/host_injections/2023.06/software/linux/x86_64/amd/zen3/software/snakemake/8.2.1-foss-2023a/bin/snakemake +``` +::: + +::: challenge +## Running Snakemake + +Run `snakemake --help | less` to see the help for all available options. +What does the `-p` option in the `snakemake` command above do? + +1. Protects existing output files +1. Prints the shell commands that are being run to the terminal +1. Tells Snakemake to only run one process at a time +1. Prompts the user for the correct input file + +*Hint: you can search in the text by pressing `/`, and quit back to the shell +with `q`* + +:::::: solution +(2) Prints the shell commands that are being run to the terminal + +This is such a useful thing we don't know why it isn't the default! The `-j1` +option is what tells Snakemake to only run one process at a time, and we'll +stick with this for now as it makes things simpler. Answer 4 is a total +red-herring, as Snakemake never prompts interactively for user input. +:::::: +::: + +::: keypoints + +- "Before running Snakemake you need to write a Snakefile" +- "A Snakefile is a text file which defines a list of rules" +- "Rules have inputs, outputs, and shell commands to be run" +- "You tell Snakemake what file to make and it will run the shell command + defined in the appropriate rule" + +::: diff --git a/episodes/02-snakemake_on_the_cluster.md b/episodes/02-snakemake_on_the_cluster.md new file mode 100644 index 0000000..30eba35 --- /dev/null +++ b/episodes/02-snakemake_on_the_cluster.md @@ -0,0 +1,248 @@ +--- +title: "Running Snakemake on the cluster" +teaching: 30 +exercises: 20 +--- + +::: objectives + +- "Define rules to run locally and on the cluster" + +::: + +::: questions + +- "How do I run my Snakemake rule on the cluster?" + +::: + +What happens when we want to make our rule run on the cluster rather than the +login node? The cluster we are using uses Slurm, and it happens that Snakemake +has built in support for Slurm, we just need to tell it that we want to use it. + +Snakemake uses the `executor` option to allow you to select the plugin that you +wish to execute the rule. The quickest way to apply this to your Snakefile is to +define this on the command line. Let's try it out + +```bash +[ocaisa@node1 ~]$ snakemake -j1 -p --executor slurm hostname_login +``` + +```output +Building DAG of jobs... +Retrieving input from storage. +Nothing to be done (all requested files are present and up to date). +``` + +Nothing happened! Why not? When it is asked to build a target, Snakemake checks +the 'last modification time' of both the target and its dependencies. If any +dependency has been updated since the target, then the actions are re-run to +update the target. Using this approach, Snakemake knows to only rebuild the +files that, either directly or indirectly, depend on the file that changed. This +is called an _incremental build_. + +::: callout +## Incremental Builds Improve Efficiency + +By only rebuilding files when required, Snakemake makes your processing +more efficient. +::: + + +::: challenge +## Running on the cluster + +We need another rule now that executes the `hostname` on the _cluster_. Create +a new rule in your Snakefile and try to execute it on cluster with the option +`--executor slurm` to `snakemake`. + +:::::: solution +The rule is almost identical to the previous rule save for the rule name and +output file: + +```python +rule hostname_remote: + output: "hostname_remote.txt" + input: + shell: + "hostname > hostname_remote.txt" + +``` +You can then execute the rule with +```bash +[ocaisa@node1 ~]$ snakemake -j1 -p --executor slurm hostname_remote +``` +```output +Building DAG of jobs... +Retrieving input from storage. +Using shell: /cvmfs/software.eessi.io/versions/2023.06/compat/linux/x86_64/bin/bash +Provided remote nodes: 1 +Job stats: +job count +--------------- ------- +hostname_remote 1 +total 1 + +Select jobs to execute... +Execute 1 jobs... + +[Mon Jan 29 18:03:46 2024] +rule hostname_remote: + output: hostname_remote.txt + jobid: 0 + reason: Missing output files: hostname_remote.txt + resources: tmpdir= + +hostname > hostname_remote.txt +No SLURM account given, trying to guess. +Guessed SLURM account: def-users +No wall time information given. This might or might not work on your cluster. If not, specify the resource runtime in your rule or as a reasonable default via --default-resources. +No job memory information ('mem_mb' or 'mem_mb_per_cpu') is given - submitting without. This might or might not work on your cluster. +Job 0 has been submitted with SLURM jobid 326 (log: /home/ocaisa/.snakemake/slurm_logs/rule_hostname_remote/326.log). +[Mon Jan 29 18:04:26 2024] +Finished job 0. +1 of 1 steps (100%) done +Complete log: .snakemake/log/2024-01-29T180346.788174.snakemake.log +``` +Note all the warnings that Snakemake is giving us about the fact that the rule +may not be able to execute on our cluster as we may not have given enough +information. Luckily for us, this actually works on our cluster and we can take +a look in the output file the new rule creates, `hostname_remote.txt`: +```bash +[ocaisa@node1 ~]$ cat hostname_remote.txt +``` +```output +tmpnode1.int.jetstream2.hpc-carpentry.org +``` +:::::: + +::: + +## Snakemake profile + +Adapting Snakemake to a particular environment can entail many flags and +options. Therefore, it is possible to specify a configuration profile to be used +to obtain default options. This looks like +```bash +snakemake --profile myprofileFolder ... +``` +The profile folder must contain a file called `config.yaml` which is what will +store our options. The folder may also contain other files necessary for the +profile. Let's create the file `cluster_profile/config.yaml` and insert some of +our existing options: + +```yaml +printshellcmds: True +jobs: 3 +executor: slurm +``` + +We should now be able rerun our workflow by pointing to the profile rather than +the listing out the options. To force our workflow to rerun, we first need to +remove the output file `hostname_remote.txt`, and then we can try out our new +profile +```bash +[ocaisa@node1 ~]$ rm hostname_remote.txt +[ocaisa@node1 ~]$ snakemake --profile cluster_profile hostname_remote +``` + +The profile is extremely useful in the context of our cluster, as the Slurm +executor has lots of options, and sometimes you need to use them to be able to +submit jobs to the cluster you have access to. Unfortunately, the names of the +options in Snakemake are not _exactly_ the same as those of Slurm, so we need +the help of a translation table: + +| SLURM | Snakemake | Description | +|-------------------|-------------------|----------------------------------------------------------------| +| `--partition` | `slurm_partition` | the partition a rule/job is to use | +| `--time` | `runtime` | the walltime per job in minutes | +| `--constraint` | `constraint` | may hold features on some clusters | +| `--mem` | `mem, mem_mb` | memory a cluster node must | +| | | provide (mem: string with unit), mem_mb: int | +| `--mem-per-cpu` | `mem_mb_per_cpu` | memory per reserved CPU | +| `--ntasks` | `tasks` | number of concurrent tasks / ranks | +| `--cpus-per-task` | `cpus_per_task` | number of cpus per task (in case of SMP, rather use `threads`) | +| `--nodes` | `nodes` | number of nodes | + +The warnings given by Snakemake hinted that we may need to provide these +options. One way to do it is to provide them is as part of the Snakemake rule +using the keyword `resources`, +e.g., +```python +rule: + input: ... + output: ... + resources: + partition: + runtime: +``` +and we can also use the profile to define default values for these options to +use with our project, using the keyword `default-resources`. For example, the +available memory on our cluster is about 4GB per core, so we can add that to our +profile: +```yaml +printshellcmds: True +jobs: 3 +executor: slurm +default-resources: + - mem_mb_per_cpu=3600 +``` + +:::challenge +We know that our problem runs in a very short time. Change the default length of +our jobs to two minutes for Slurm. + +::::::solution + +```yaml +printshellcmds: True +jobs: 3 +executor: slurm +default-resources: + - mem_mb_per_cpu=3600 + - runtime=2 +``` +:::::: + +::: + +There are various `sbatch` options not directly supported by the resource +definitions in the table above. You may use the `slurm_extra` resource to +specify any of these additional flags to `sbatch`: + +```python +rule myrule: + input: ... + output: ... + resources: + slurm_extra="--mail-type=ALL --mail-user=" +``` + +## Local rule execution + +Our initial rule was to +get the hostname of the login node. We always want to run that rule on the login +node for that to make sense. If we tell Snakemake to run all rules via the +Slurm executor (which is what we are doing via our new profile) this +won't happen any more. So how do we force the rule to run on +the login node? + +Well, in the case where a Snakemake rule performs a trivial task job submission +might be overkill (e.g., less than 1 minute worth of compute time). Similar to +our case, it would be a better +idea to have these rules execute locally (i.e. where the `snakemake` command is +run) instead of as a job. Snakemake lets you indicate which rules should always +run locally with the `localrules` keyword. Let's define `hostname_login` as a +local rule near the top of our `Snakefile`. + +```python +localrules: hostname_login +``` + +::: keypoints + +- "Snakemake generates and submits its own batch scripts for your scheduler." +- "You can store default configuration settings in a Snakemake profile" +- "`localrules` defines rules that are executed locally, and never submitted to a cluster." + +::: diff --git a/episodes/03-placeholders.md b/episodes/03-placeholders.md new file mode 100644 index 0000000..8e93283 --- /dev/null +++ b/episodes/03-placeholders.md @@ -0,0 +1,79 @@ +--- +title: "Placeholders" +teaching: 40 +exercises: 30 +--- + +::: questions +- "How do I make a generic rule?" +::: + +::: objectives +- "See how Snakemake deals with some errors" +::: + +Our Snakefile has some duplication. For example, the names of text +files are repeated in places throughout the Snakefile rules. Snakefiles are +a form of code and, in any code, repetition can lead to problems (e.g. we rename +a data file in one part of the Snakefile but forget to rename it elsewhere). + +::: callout +## D.R.Y. (Don't Repeat Yourself) + +In many programming languages, the bulk of the language features are +there to allow the programmer to describe long-winded computational +routines as short, expressive, beautiful code. Features in Python, +R, or Java, such as user-defined variables and functions are useful in +part because they mean we don't have to write out (or think about) +all of the details over and over again. This good habit of writing +things out only once is known as the "Don't Repeat Yourself" +principle or D.R.Y. +::: + +Let us set about removing some of the repetition from our Snakefile. + +## Placeholders + +To make a more general-purpose rule we need **placeholders**. Let's take a look +at what a placeholder looks like + +```python +rule hostname_remote: + output: "hostname_remote.txt" + input: + shell: + "hostname > {output}" + +``` + +As a reminder, here's the previous version from the last episode: + +```python +rule hostname_remote: + output: "hostname_remote.txt" + input: + shell: + "hostname > hostname_remote.txt" + +``` + +The new rule has replaced explicit file names with things in `{curly brackets}`, +specifically `{output}` (but it could also have been `{input}`...if that had +a value and were useful). + + +### `{input}` and `{output}` are **placeholders** + +Placeholders are used in the `shell` section of a rule, and Snakemake will +replace them with appropriate values - `{input}` with the full name of the input +file, and +`{output}` with the full name of the output file -- before running the command. + +`{resources}` is also a placeholder, and we can access a named element of the +`{resources}` with the notation `{resources.runtime}` (for example). + +:::keypoints +- "Snakemake rules are made more generic with placeholders" +- "Placeholders in the shell part of the rule are replaced with values based on the chosen + wildcards" +::: diff --git a/episodes/04-snakemake_and_mpi.md b/episodes/04-snakemake_and_mpi.md new file mode 100644 index 0000000..0e7b41a --- /dev/null +++ b/episodes/04-snakemake_and_mpi.md @@ -0,0 +1,442 @@ +--- +title: "MPI applications and Snakemake" +teaching: 30 +exercises: 20 +--- + +::: objectives + +- "Define rules to run locally and on the cluster" + +::: + +::: questions + +- "How do I run an MPI application via Snakemake on the cluster?" + +::: + +Now it's time to start getting back to our real workflow. We can execute a +command on the cluster, but what about executing the MPI application we are +interested in? Our application is called `amdahl` and is available as an +environment module. + +::: challenge + +Locate and load the `amdahl` module and then _replace_ our `hostname_remote` +rule with a version that runs `amdahl`. (Don't worry about parallel MPI just +yet, run it with a single CPU, `mpiexec -n 1 amdahl`). + +Does your rule execute correctly? If not look through the log files to find out +why? + +::::::solution + +```bash +module spider amdahl +module load amdahl +``` +will locate and then load the `amdahl` module. We can then update/replace our +rule to run the `amdahl` application: +```python +rule amdahl_run: + output: "amdahl_run.txt" + input: + shell: + "mpiexec -n 1 amdahl > {output}" +``` +However, when we try to execute the rule we get an error (unless you already +have a different version of `amdahl` available in your path). Snakemake +reports the +location of the logs and if we look inside we can (eventually) find +```output +... +mpiexec -n 1 amdahl > amdahl_run.txt +-------------------------------------------------------------------------- +mpiexec was unable to find the specified executable file, and therefore +did not launch the job. This error was first reported for process +rank 0; it may have occurred for other processes as well. + +NOTE: A common cause for this error is misspelling a mpiexec command + line parameter option (remember that mpiexec interprets the first + unrecognized command line token as the executable). + +Node: tmpnode1 +Executable: amdahl +-------------------------------------------------------------------------- +... +``` +So, even though we loaded the module before running the workflow, our +Snakemake rule didn't find the executable. That's because the Snakemake rule +is running in a clean _runtime environment_, and we need to somehow tell it to +load the necessary environment module before trying to execute the rule. + +:::::: +::: + +## Snakemake and environment modules + +Our application is called `amdahl` and is available on the system via an +environment module, so we need to +tell Snakemake to load the module before it tries to execute the rule. Snakemake +is aware of environment modules, and these can be specified via (yet another) +option: +```python +rule amdahl_run: + output: "amdahl_run.txt" + input: + envmodules: + "mpi4py", + "amdahl" + input: + shell: + "mpiexec -n 1 amdahl > {output}" +``` + +Adding these lines are not enough to make the rule execute however. Not only do +you have to tell Snakemake what modules to load, but you also have to tell it to +use environment modules in general (since the use of environment modules is +considered to make your runtime environment less reproducible as the available +modules may differ from cluster to cluster). This requires you to give Snakemake +an additonal option +```bash +snakemake --profile cluster_profile --use-envmodules amdahl_run +``` + +::: challenge + +We'll be using environment modules throughout the rest of tutorial, so make that +a default option of our profile (by setting it's value to `True`) + +::::::solution + +Update our cluster profile to +```yaml +printshellcmds: True +jobs: 3 +executor: slurm +default-resources: + - mem_mb_per_cpu=3600 + - runtime=2 +use-envmodules: True +``` +If you want to test it, you need to erase the output file of the rule and rerun +Snakemake. + +:::::: + +::: + +## Snakemake and MPI + +We didn't really run an MPI application in the last section as we only ran on +one core. How do we request to run on multiple cores for a single rule? + +Snakemake has general support for MPI, but the only executor that currently +explicitly supports MPI is the Slurm executor (lucky for us!). If we look back +at our Slurm to Snakemake translation table we notice the relevant options +appear near the bottom: + +| SLURM | Snakemake | Description | +|-------------------|-------------------|----------------------------------------------------------------| +| ... | ... | ... | +| `--ntasks` | `tasks` | number of concurrent tasks / ranks | +| `--cpus-per-task` | `cpus_per_task` | number of cpus per task (in case of SMP, rather use `threads`) | +| `--nodes` | `nodes` | number of nodes | + +The one we are interested is `tasks` as we are only going to increase the number +of ranks. We can define these in a `resources` section of our rule and refer to +them using placeholders: +```python +rule amdahl_run: + output: "amdahl_run.txt" + input: + envmodules: + "amdahl" + resources: + mpi='mpiexec', + tasks=2 + input: + shell: + "{resources.mpi} -n {resources.tasks} amdahl > {output}" +``` + +That worked but now we have a bit of an issue. We want to do this for a few +different values of `tasks` that would mean we would need a different output +file for every run. It would be great if we can somehow indicate in the `output` +the value that we want to use for `tasks`...and have Snakemake pick that up. + +We could use a _wildcard_ in the `output` to allow us to +define the `tasks` we wish to use. The syntax for such a wildcard looks like +```python +output: "amdahl_run_{parallel_tasks}.txt" +``` +where `parallel_tasks` is our wildcard. + +::: callout +## Wildcards + +Wildcards are used in the `input` and `output` lines of the rule to represent +parts of filenames. +Much like the `*` pattern in the shell, the wildcard can stand in for any text +in order to make up +the desired filename. As with naming your rules, you may choose any name you +like for your +wildcards, so here we used `parallel_tasks`. Using the same wildcards in the +input and output is what tells Snakemake how to match input files to output +files. + +If two rules use a wildcard with the same name then Snakemake will treat them as +different entities - rules in Snakemake are self-contained in this way. + +In the `shell` line you can reference the wildcard with +`{wildcards.parallel_tasks}` +::: + +## Snakemake order of operations + +We're only just getting started with some simple rules, but it's worth thinking about exactly what Snakemake is doing when you run it. There are three distinct phases: + +1. Prepares to run: + 1. Reads in all the rule definitions from the Snakefile +1. Plans what to do: + 1. Sees what file(s) you are asking it to make + 1. Looks for a matching rule by looking at the `output`s of all the rules it knows + 1. Fills in the wildcards to work out the `input` for this rule + 1. Checks that this input file (if required) is actually available +1. Runs the steps: + 1. Creates the directory for the output file, if needed + 1. Removes the old output file if it is already there + 1. Only then, runs the shell command with the placeholders replaced + 1. Checks that the command ran without errors *and* made the new output file as expected + +::: callout +## Dry-run (`-n`) mode + +It's often useful to run just the first two phases, so that Snakemake will plan out the jobs to +run, and print them to the screen, but never actually run them. This is done with the `-n` +flag, eg: + +```bash +> $ snakemake -n ... +``` +::: + +The amount of checking may seem pedantic right now, but as the workflow gains more steps this will +become very useful to us indeed. + +## Using wildcards in our rule + +We would like to use a wildcard in the `output` to allow us to +define the number of `tasks` we wish to use. Based on what we've seen so far, +you might imagine this could look like +```python +rule amdahl_run: + output: "amdahl_run_{parallel_tasks}.txt" + input: + envmodules: + "amdahl" + resources: + mpi="mpiexec", + tasks="{parallel_tasks}" + input: + shell: + "{resources.mpi} -n {resources.tasks} amdahl > {output}" +``` +but there are two problems with this: + +* The only way for Snakemake to know the value of the wildcard is for the user + to explicitly request a concrete output file (rather than call the rule): + ```bash + snakemake --profile cluster_profile amdahl_run_2.txt + ``` + This is perfectly valid, as Snakemake can figure out that it has a rule that + can match that filename. +* The bigger problem is that even doing that does not work, it seems we cannot + use a wildcard for `tasks`: + ```output + WorkflowError: + SLURM job submission failed. The error message was sbatch: error: Invalid numeric value "{parallel_tasks}" for --ntasks. + ``` + +Unfortunately for us, there is no direct way for us to access the wildcards +for `tasks`. The +reason for this is that Snakemake tries to use the value of `tasks` during it's +initialisation stage, which is before we know the value of the wildcard. We need +to defer the determination of `tasks` to later on. This can be achieved by +specifying an input function instead of a value for this +scenario. The solution then is to write a one-time use function to manipulate +Snakemake into doing this for us. Since the function is specifically for the +rule, we can use a one-line function without a name. These kinds of functions +are called either anonymous functions or lamdba functions (both mean the same +thing), and are a feature of Python (and other programming languages). + +To define a lambda function in python, the general syntax is as follows: +```python +lambda x: x + 54 +``` +Since our function _can_ take the wildcards as arguments, we can use that to set +the value for `tasks`: +```python +rule amdahl_run: + output: "amdahl_run_{parallel_tasks}.txt" + input: + envmodules: + "amdahl" + resources: + mpi="mpiexec", + # No direct way to access the wildcard in tasks, so we need to do this + # indirectly by declaring a short function that takes the wildcards as an + # argument + tasks=lambda wildcards: int(wildcards.parallel_tasks) + input: + shell: + "{resources.mpi} -n {resources.tasks} amdahl > {output}" +``` + +Now we have a rule that can be used to generate output from runs of an +arbitrary number of parallel tasks. + +::: callout + +## Comments in Snakefiles + +In the above code, the line beginning `#` is a comment line. Hopefully you are already in the +habit of adding comments to your own scripts. Good comments make any script more readable, and +this is just as true with Snakefiles. + +::: + +Since our rule is now capable of generating an arbitrary number of output files +things could get very crowded in our current directory. It's probably best then +to put the runs into a separate folder to keep things tidy. We can add the +folder directly to our `output` and Snakemake will take of directory creation +for us: + +```python +rule amdahl_run: + output: "runs/amdahl_run_{parallel_tasks}.txt" + input: + envmodules: + "amdahl" + resources: + mpi="mpiexec", + # No direct way to access the wildcard in tasks, so we need to do this + # indirectly by declaring a short function that takes the wildcards as an + # argument + tasks=lambda wildcards: int(wildcards.parallel_tasks) + input: + shell: + "{resources.mpi} -n {resources.tasks} amdahl > {output}" +``` + +::: challenge + +Create an output file (under the `runs` folder) for the case where we have 6 +parallel tasks + +(HINT: Remember that Snakemake needs to be able to match the requested file to +the `output` from a rule) + +:::::: solution + +```bash +snakemake --profile cluster_profile runs/amdahl_run_6.txt +``` + +:::::: + +::: + +Another thing about our application `amdahl` is that we ultimately want to +process the output to generate our scaling plot. The output right now is useful +for reading but makes processing harder. `amdahl` has an option that actually +makes this easier for us. To see the `amdahl` options we can use +```bash +[ocaisa@node1 ~]$ module load amdahl +[ocaisa@node1 ~]$ amdahl --help +``` +```output +usage: amdahl [-h] [-p [PARALLEL_PROPORTION]] [-w [WORK_SECONDS]] [-t] [-e] + +options: + -h, --help show this help message and exit + -p [PARALLEL_PROPORTION], --parallel-proportion [PARALLEL_PROPORTION] + Parallel proportion should be a float between 0 and 1 + -w [WORK_SECONDS], --work-seconds [WORK_SECONDS] + Total seconds of workload, should be an integer greater than 0 + -t, --terse Enable terse output + -e, --exact Disable random jitter +``` +The option we are looking for is `--terse`, and that will make `amdahl` print +output in a format that is much easier to process, JSON. JSON format in a file +typically uses the file extension `.json` so let's add that option to our +`shell` command _and_ change the file format of the `output` to match our new +command: + +```python +rule amdahl_run: + output: "runs/amdahl_run_{parallel_tasks}.json" + input: + envmodules: + "amdahl" + resources: + mpi="mpiexec", + # No direct way to access the wildcard in tasks, so we need to do this + # indirectly by declaring a short function that takes the wildcards as an + # argument + tasks=lambda wildcards: int(wildcards.parallel_tasks) + input: + shell: + "{resources.mpi} -n {resources.tasks} amdahl --terse > {output}" +``` + +There was another parameter for `amdahl` that caught my eye. `amdahl` has an +option `--parallel-proportion` (or `-p`) which we might be interested in +changing as it changes the behaviour of the code,and therefore has an impact on +the values we get in our results. Let's add +another directory layer to our output format to reflect a particular choice for +this value. We can use a wildcard so we done have to choose the value right +away: + +```python +rule amdahl_run: + output: "p_{parallel_proportion}/runs/amdahl_run_{parallel_tasks}.json" + input: + envmodules: + "amdahl" + resources: + mpi="mpiexec", + # No direct way to access the wildcard in tasks, so we need to do this + # indirectly by declaring a short function that takes the wildcards as an + # argument + tasks=lambda wildcards: int(wildcards.parallel_tasks) + input: + shell: + "{resources.mpi} -n {resources.tasks} amdahl --terse -p {wildcards.parallel_proportion} > {output}" +``` + +::: challenge + +Create an output file for a value of `-p` of 0.999 (the default value is 0.8) +for the case where we have 6 parallel tasks. + +:::::: solution + +```bash +snakemake --profile cluster_profile p_0.999/runs/amdahl_run_6.json +``` + +:::::: + +::: + + +::: keypoints + +- "Snakemake chooses the appropriate rule by replacing wildcards such that the + output matches the target" +- "Snakemake checks for various error conditions and will stop if it sees a + problem" + +::: diff --git a/episodes/05-chaining_rules.md b/episodes/05-chaining_rules.md new file mode 100644 index 0000000..b8cdbfb --- /dev/null +++ b/episodes/05-chaining_rules.md @@ -0,0 +1,191 @@ +--- +title: "Chaining rules" +teaching: 40 +exercises: 30 +--- + +::: questions +- "How do I combine rules into a workflow?" +- "How do I make a rule with multiple inputs and outputs?" +::: + +::: objectives +- "" +::: + +## A pipeline of multiple rules + +We now have a rule that can generate output for any value of `-p` and any number +of tasks, we just need to call Snakemake with the parameters that we want: +```bash +snakemake --profile cluster_profile p_0.999/runs/amdahl_run_6.json +``` + +That's not exactly convenient though, to generate a full dataset we have to run +Snakemake lots of times with different output file targets. Rather than that, +let's create a rule that can generate those files for us. + +Chaining rules in Snakemake is a matter of choosing filename patterns that +connect the rules. +There's something of an art to it - most times there are several options that +will work: + +```python +rule generate_run_files: + output: "p_{parallel_proportion}_runs.txt" + input: "p_{parallel_proportion}/runs/amdahl_run_6.json" + shell: + "echo {input} done > {output}" +``` + +::: challenge + +The new rule is doing no work, it's just making sure we create the file we want. +It's not worth executing on the cluster. How do ensure it runs on the login node +only? + +:::::: solution + +We need to add the new rule to our `localrules`: +```python +localrules: hostname_login, generate_run_files +``` + +::: + +::: + +Now let's run the new rule (remember we need to request the output file by name +as the `output` in our rule contains a wildcard pattern): +```bash +[ocaisa@node1 ~]$ snakemake --profile cluster_profile/ p_0.999_runs.txt +``` +```output +Using profile cluster_profile/ for setting default command line arguments. +Building DAG of jobs... +Retrieving input from storage. +Using shell: /cvmfs/software.eessi.io/versions/2023.06/compat/linux/x86_64/bin/bash +Provided remote nodes: 3 +Job stats: +job count +------------------ ------- +amdahl_run 1 +generate_run_files 1 +total 2 + +Select jobs to execute... +Execute 1 jobs... + +[Tue Jan 30 17:39:29 2024] +rule amdahl_run: + output: p_0.999/runs/amdahl_run_6.json + jobid: 1 + reason: Missing output files: p_0.999/runs/amdahl_run_6.json + wildcards: parallel_proportion=0.999, parallel_tasks=6 + resources: mem_mb=1000, mem_mib=954, disk_mb=1000, disk_mib=954, tmpdir=, mem_mb_per_cpu=3600, runtime=2, mpi=mpiexec, tasks=6 + +mpiexec -n 6 amdahl --terse -p 0.999 > p_0.999/runs/amdahl_run_6.json +No SLURM account given, trying to guess. +Guessed SLURM account: def-users +Job 1 has been submitted with SLURM jobid 342 (log: /home/ocaisa/.snakemake/slurm_logs/rule_amdahl_run/342.log). +[Tue Jan 30 17:47:31 2024] +Finished job 1. +1 of 2 steps (50%) done +Select jobs to execute... +Execute 1 jobs... + +[Tue Jan 30 17:47:31 2024] +localrule generate_run_files: + input: p_0.999/runs/amdahl_run_6.json + output: p_0.999_runs.txt + jobid: 0 + reason: Missing output files: p_0.999_runs.txt; Input files updated by another job: p_0.999/runs/amdahl_run_6.json + wildcards: parallel_proportion=0.999 + resources: mem_mb=1000, mem_mib=954, disk_mb=1000, disk_mib=954, tmpdir=/tmp, mem_mb_per_cpu=3600, runtime=2 + +echo p_0.999/runs/amdahl_run_6.json done > p_0.999_runs.txt +[Tue Jan 30 17:47:31 2024] +Finished job 0. +2 of 2 steps (100%) done +Complete log: .snakemake/log/2024-01-30T173929.781106.snakemake.log +``` + +Look at the logging messages that Snakemake prints in the terminal. What has happened here? + +1. Snakemake looks for a rule to make `p_0.999_runs.txt` +1. It determines that "generate_run_files" can make this if + `parallel_proportion=0.999` +1. It sees that the input needed is therefore `p_0.999/runs/amdahl_run_6.json` +

+1. Snakemake looks for a rule to make `p_0.999/runs/amdahl_run_6.json` +1. It determines that "amdahl_run" can make this if `parallel_proportion=0.999` + and `parallel_tasks=6` +

+1. Now Snakemake has reached an available input file (in this case, no input + file is actually required), it runs both steps to get the final output + +This, in a nutshell, is how we build workflows in Snakemake. + +1. Define rules for all the processing steps +1. Choose `input` and `output` naming patterns that allow Snakemake to link the + rules +1. Tell Snakemake to generate the final output file(s) + +If you are used to writing regular scripts this takes a little +getting used to. Rather than listing steps in order of execution, you are alway +**working backwards** from the final desired result. The order of operations is +determined by applying the pattern matching rules to the filenames, not by the +order of the rules in the Snakefile. + +::: callout + +## Outputs first? + +The Snakemake approach of working backwards from the desired output to determine +the workflow is why we're putting the `output` lines first in all our rules - to +remind us that these are what Snakemake looks at first! + +Many users of Snakemake, and indeed the official documentation, prefer to have +the `input` first, so in practice you should use whatever order makes sense to +you. + +::: + +::: callout + +## `log` outputs in Snakemake + +Snakemake has a dedicated rule field for outputs that are +[log files](https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#log-files), +and these are mostly treated as regular outputs except that log files are not +removed if the job produces an error. This means you can look at the log to help +diagnose the error. In a real workflow this can be very useful, but in terms of +learning the fundamentals of Snakemake we'll stick with regular `input` and +`output` fields here. + +::: + +::: callout + +## Errors are normal + +Don't be disheartened if you see errors when first testing +your new Snakemake pipelines. There is a lot that can go wrong when writing a +new workflow, and you'll normally need several iterations to get things just +right. One advantage of the Snakemake approach compared to regular scripts is +that Snakemake fails fast when there is a problem, rather than ploughing on +and potentially running junk calculations on partial or corrupted data. Another +advantage is that when a step fails we can safely resume from where we left off. + +::: + + + +::: keypoints +- "Snakemake links rules by iteratively looking for rules that make missing + inputs" +- "Rules may have multiple named inputs and/or outputs" +- "If a shell command does not yield an expected output then Snakemake will + regard that as a failure" +::: + diff --git a/episodes/06-expansion.md b/episodes/06-expansion.md new file mode 100644 index 0000000..e332dbe --- /dev/null +++ b/episodes/06-expansion.md @@ -0,0 +1,194 @@ +--- +title: "Processing lists of inputs" +teaching: 50 +exercises: 30 +--- + +::: questions +- "How do I process multiple files at once?" +- "How do I combine multiple files together?" +::: + +::: objectives +- "Use Snakemake to process all our samples at once" +- "Make a scalability plot that brings our results together" +::: + +We created a rule that can generate a single output file, but we're not going to +create multiple rules for every output file. We want to generate all of the run +files with a single rule if we could, well Snakemake can indeed take a list of +input files: + +```python +rule generate_run_files: + output: "p_{parallel_proportion}_runs.txt" + input: "p_{parallel_proportion}/runs/amdahl_run_2.json", "p_{parallel_proportion}/runs/amdahl_run_6.json" + shell: + "echo {input} done > {output}" +``` + +That's great, but we don't want to have to list all of the files we're +interested in individually. How can we do this? + +## Defining a list of samples to process + +To do this, we can define some lists as Snakemake **global variables**. + +Global variables should be added before the rules in the Snakefile. + +```python +# Task sizes we wish to run +NTASK_SIZES = [1, 2, 3, 4, 5] +``` + +* Unlike with variables in shell scripts, we can put spaces around the `=` sign, + but they are not mandatory. +* The lists of quoted strings are enclosed in square brackets and + comma-separated. If you know any Python you'll recognise this as Python list + syntax. +* A good convention is to use capitalized names for these variables, but this is + not mandatory. +* Although these are referred to as variables, you can't actually change the + values once the workflow is running, so lists defined this way are more like + constants. + +## Using a Snakemake rule to define a batch of outputs + +Now let's update our Snakefile to leverage the new global variable and create a +list of files: +```python +rule generate_run_files: + output: "p_{parallel_proportion}_runs.txt" + input: expand("p_{{parallel_proportion}}/runs/amdahl_run_{count}.json", count=NTASK_SIZES) + shell: + "echo {input} done > {output}" +``` + +The `expand(...)` function in this rule generates a list of filenames, by taking +the first thing in the single parentheses as a template and replacing `{count}` +with all the `NTASK_SIZES`. Since there are 5 elements in the list, this will +yield 5 files we want to make. Note that we had to protect our wildcard in a +second set of parentheses so it wouldn't be interpreted as something that needed +to be expanded. + +In our current case we still rely on the file name to define the value of the +wildcard `parallel_proportion` so we can't call the rule directly, we still need +to request a specific file: + +```bash +snakemake --profile cluster_profile/ p_0.999_runs.txt +``` + +If you don't specify a target rule name or any file names on the command line +when running Snakemake, the default is to use **the first rule** in the +Snakefile as the target. + +::: callout +## Rules as targets + +Giving the name of a rule to Snakemake on the command line only works when that +rule has *no wildcards* in the outputs, because Snakemake has no way to know +what the desired wildcards might be. You will see the error "Target rules may +not contain wildcards." This can also happen when you don't supply any explicit +targets on the command line at all, and Snakemake tries to runthe first rule +defined in the Snakefile. + +::: + +## Rules that combine multiple inputs + +Our `generate_run_files` rule is a rule which takes a list of input files. The +length of that list is not fixed by the rule, but can change based on +`NTASK_SIZES`. + +In our workflow the final step is to take all the generated files and combine +them into a plot. To do that, you may have heard that some people use a python +library called `matplotlib`. It's beyond the scope of this tutorial to write +the python script to create a final plot, so we provide you with the script as +part of this lesson. You can download it with +```bash +curl -O https://ocaisa.github.io/hpc-workflows/files/plot_terse_amdahl_results.py +``` + +The script `plot_terse_amdahl_results.py` needs a command line that looks like: +```bash +python plot_terse_amdahl_results.py <1st input file> <2nd input file> ... +``` +Let's introduce that into our `generate_run_files` rule: + + +```python +rule generate_run_files: + output: "p_{parallel_proportion}_runs.txt" + input: expand("p_{{parallel_proportion}}/runs/amdahl_run_{count}.json", count=NTASK_SIZES) + shell: + "python plot_terse_amdahl_results.py {output} {input}" +``` + +::: challenge + +This script relies on `matplotlib`, is it available as an environment module? +Add this requirement to our rule. + +:::::: solution + +```python +rule generate_run_files: + output: "p_{parallel_proportion}_scalability.jpg" + input: expand("p_{{parallel_proportion}}/runs/amdahl_run_{count}.json", count=NTASK_SIZES) + envmodules: + "matplotlib" + shell: + "python plot_terse_amdahl_results.py {output} {input}" +``` + +:::::: + +::: + +Now we finally get to generate a scaling plot! Run the final Snakemake command +```bash +snakemake --profile cluster_profile/ p_0.999_scalability.jpg +``` + +::: challenge + +Generate the scalability plot for all values from 1 to 10 cores. + +:::::: solution + +```python +NTASK_SIZES = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] +``` + +:::::: + +::: + +::: challenge + +Rerun the workflow for a `p` value of 0.8 + +:::::: solution + +```bash +snakemake --profile cluster_profile/ p_0.8_scalability.jpg +``` + +:::::: + +::: + +::: challenge +## Bonus round + +Create a final rule that can be called directly and generates a scaling plot for +3 different values of `p`. + +::: + +::: keypoints +- "Use the `expand()` function to generate lists of filenames you want to combine" +- "Any `{input}` to a rule can be a variable-length list" +::: + diff --git a/episodes/amdahl_foundation.md b/episodes/amdahl_foundation.md deleted file mode 100644 index fb0a532..0000000 --- a/episodes/amdahl_foundation.md +++ /dev/null @@ -1,126 +0,0 @@ ---- -title: "Running a Parallel Application on the Cluster" -teaching: 10 -exercises: 2 ---- - -:::::::::::::::::::::::::::::: questions - -- What output does the Amdahl code generate? -- Why does parallelizing the amdahl code make it faster? - -:::::::::::::::::::::::::::::::::::::::: - -::::::::::::::::::::::::::::: objectives - -- Run the amdahl parallel code on the cluster -- Note what output is generated, and where it goes -- Predict the trend of execution time vs parallelism - -:::::::::::::::::::::::::::::::::::::::: - -## Introduction - -A high-performance computing cluster offers powerful -computational resources to its users, but taking advantage -of these resources is not always straightforward. The -cluster system does not work in the same way as systems -you may be more familiar with. - -The software we will use in this lesson is a model of -the kind of parallel task that is well-adapted to -high-performance computing resources. It's called "amdahl", -named for Eugene Amdahl, a famous computer scientist who -coined "Amdahl's Law", which is about the advantages and -limitations of parallelism in code execution. - -:::::::::::::::::::::::::::::::: callout - -[Amdahl's Law](https://en.wikipedia.org/wiki/Amdahl%27s_law) is -a statement about how much benefit you can expect to get by -parallelizing a computer program. - -The limitation arises from the fact that, in any application, -there is some fraction of the work to be done which is inherently -serial, and some fraction which is amenable to parallelization. -The law is a quantitative expression of the fact that, by -parallelizing the code, you can only ever make the parallel -part faster, you cannot reduce the execution time of the -serial part. - -As a practical matter, this means that developer effort spent -on parallelization has diminishing returns on the overall -reduction in execution time. - -:::::::::::::::::::::::::::::::::::::::: - -## The Amdahl Code - -Download it and install it, via pip. -Note that `amdahl` depends on MPI, -so make sure that's also available. - -On the HPC Carpentry cluster: - -``` shell -[user@login1 ~]$ module load OpenMPI -[user@login1 ~]$ module load Python -[user@login1 ~]$ pip install amdahl -``` - -## Running It on the Cluster - -Use the `sacct` command to see the run-time. -The run-time is also recorded in the output itself. - -``` shell -[user@login1 ~]$ nano amdahl_1.sh -``` - -``` bash -#!/bin/bash -#SBATCH -t 00:01 # max 1 minute -#SBATCH -p smnodes # max 4 cores -#SBATCH -n 1 # use 1 core -#SBATCH -o amdahl-np1.out # record result - -module load OpenMPI -module load Python - -mpirun amdahl -``` - -``` shell -[user@login1 ~]$ sbatch amdahl_1.sh -``` - -:::::::::::::::::::::::::::::: challenge - -Run the amdhal code with a few (small!) levels -of parallelism. Make a quantitative estimate of -how much faster the code will run with 3 processors -than 2. The naive estimate would be that it would run -1.5× the speed, or equivalently, that it would -complete in 2/3 the time. - -:::::::::::::::: solution - -``` shell -[user@login1 ~]$ sbatch amdahl_1.sh # serial job ~ 25 sec -[user@login1 ~]$ sbatch amdahl_2.sh # 2-way parallel ~ 20 sec -[user@login1 ~]$ sbatch amdahl_3.sh # 3-way parallel ~ 16 sec -``` - -The amdahl code runs faster with 3 processors than with -2, but the speed-up is less than 1.5×. - -::::::::::::::::::::::::: - -:::::::::::::::::::::::::::::::::::::: - -:::::::::::::::::::::::::::::: keypoints - -- The amdahl code is a model of a parallel application -- The execution speed depends on the degree of parallelism - -:::::::::::::::::::::::::::::::::::::::: diff --git a/episodes/amdahl_snakemake.md b/episodes/amdahl_snakemake.md deleted file mode 100644 index 4686339..0000000 --- a/episodes/amdahl_snakemake.md +++ /dev/null @@ -1,61 +0,0 @@ ---- -title: "Amdahl Parallel Runs" -teaching: 10 -exercises: 2 ---- - -:::::::::::::::::::::::::::::: questions - -- How can we collect data on Amdahl run times? - -:::::::::::::::::::::::::::::::::::::::: - -::::::::::::::::::::::::::::: objectives - -- Collect systematic data on the runtime of the amdahl code - -:::::::::::::::::::::::::::::::::::::::: - -## Systematic Data Collection - -Using what we have learned so far, including Snakemake -profiles and rules, we will now compose a Snakefile -that runs the Amdahl example code over a range of -parallel widths. This workflow will generate the -data we will use in the next module to demonstrate -the diminishing returns of increasing parallelism. - -## Write a File - -Compose the Snakemake file that does what we want. - -We can put the widths in a list and iterate over -them. We will use the profile generated previously -to ensure that the jobs run on the cluster. - -## Run Snakemake - -Throw the switch! - -:::::::::::::::::::::::::::::: challenge - -Our example has a single paramter, the parallelism, -that we vary. How would you generalize this to arbitrary -parameters? - -:::::::::::::::: solution - -Arbitrary parameters are still finite, so you could -just generate a flat list of all the combinations, and iterate -over that. Or you could generate two lists and do a nested -loop. - -::::::::::::::::::::::::: - -:::::::::::::::::::::::::::::::::::::::: - -:::::::::::::::::::::::::::::: keypoints - -- A relatively compact snakemake file collects interesting data. - -:::::::::::::::::::::::::::::::::::::::: diff --git a/episodes/files/Snakefile_amdahl_cluster b/episodes/files/Snakefile_amdahl_cluster deleted file mode 100644 index eca4d3e..0000000 --- a/episodes/files/Snakefile_amdahl_cluster +++ /dev/null @@ -1,8 +0,0 @@ -rule one: - input: - output: 'amdahl_cluster.txt' - resources: - mpi="mpirun", - tasks=3 - shell: - "module load OpenMPI; mpirun -np {resources.tasks} amdahl > amdahl_cluster.txt" diff --git a/episodes/files/Snakefile_cluster b/episodes/files/Snakefile_cluster deleted file mode 100644 index ac60d86..0000000 --- a/episodes/files/Snakefile_cluster +++ /dev/null @@ -1,4 +0,0 @@ -rule: - input: - output: 'host.txt' - shell: 'hostname > host.txt' diff --git a/episodes/files/Snakefile_cluster_iteration b/episodes/files/Snakefile_cluster_iteration deleted file mode 100644 index 41a94a2..0000000 --- a/episodes/files/Snakefile_cluster_iteration +++ /dev/null @@ -1,24 +0,0 @@ -# -# Run a bunch of Amdahl jobs and aggregate the output. -# -WIDTHS=[1,2] -# -def getwidth(wildcards): - return wildcards.sample - -rule plot: - input: expand('{size}.out',size=WIDTHS) - output: 'done.out' - resources: - mpi="mpirun", - tasks=1 - shell: 'echo "{WIDTHS}, Done!" > done.out' -rule iterate: - input: - output: '{sample}.out' - resources: - mpi="mpirun", - tasks=getwidth - shell: - "module load OpenMPI; mpirun -np {resources.tasks} amdahl > {wildcards.sample}.out" - diff --git a/episodes/files/Snakefile_hello b/episodes/files/Snakefile_hello deleted file mode 100644 index 0b94a00..0000000 --- a/episodes/files/Snakefile_hello +++ /dev/null @@ -1,4 +0,0 @@ -rule: - input: - output: 'hello.txt' - shell: 'echo "Hello there, world!" >> hello.txt' diff --git a/episodes/files/Snakefile_iterative b/episodes/files/Snakefile_iterative deleted file mode 100644 index 8fe13f8..0000000 --- a/episodes/files/Snakefile_iterative +++ /dev/null @@ -1,13 +0,0 @@ -# -# Iterative example. -# -NAMES=['one','two','three'] -# -rule done: - input: expand('{name}.out',name=NAMES) - output: 'done.out' - shell: 'echo "Done!" > done.out' -rule iterate: - input: - output: '{sample}.out' - shell: 'echo {output} > {output}' diff --git a/episodes/files/Snakefile_tworules b/episodes/files/Snakefile_tworules deleted file mode 100644 index 66558a6..0000000 --- a/episodes/files/Snakefile_tworules +++ /dev/null @@ -1,9 +0,0 @@ -rule last: - input: 'lower.txt' - output: 'upper.txt' - shell: 'cat lower.txt | tr a-z A-Z > upper.txt' - -rule first: - input: - output: 'lower.txt' - shell: 'echo "Hello, world!" > lower.txt' diff --git a/episodes/files/plot_terse_amdahl_results.py b/episodes/files/plot_terse_amdahl_results.py new file mode 100644 index 0000000..a85425f --- /dev/null +++ b/episodes/files/plot_terse_amdahl_results.py @@ -0,0 +1,49 @@ +import sys +import json +import matplotlib.pyplot as plt +import numpy as np + +def process_files(file_list, output="plot.jpg"): + value_tuples=[] + for filename in file_list: + # Open the JSON file and load data + with open(filename, 'r') as file: + data = json.load(file) + value_tuples.append((data['nproc'], data['execution_time'])) + + # Sort the tuples + sorted_list = sorted(value_tuples) + + # Unzip the sorted list into two lists + x, y = zip(*sorted_list) + + # Create a line plot + plt.plot(x, y, marker='o') + + # Adding the y=1/x line + x_line = np.linspace(1, max(x), 100) # Create x values for the line + y_line = (y[0]/x[0]) / x_line # Calculate corresponding (scaled) y values + + plt.plot(x_line, y_line, linestyle='--', color='red', label='Perfect scaling') + + # Adding title and labels + plt.title("Scaling plot") + plt.xlabel("Number of cores") + plt.ylabel("Wallclock time (seconds)") + + # Show the legend + plt.legend() + + # Save the plot to a JPEG file + plt.savefig(output, format='jpeg') + +if __name__ == "__main__": + # The first command-line argument is the script name itself, so we skip it + output = sys.argv[1] + filenames = sys.argv[2:] + + if filenames: + process_files(filenames, output=output) + else: + print("No files provided.") + diff --git a/episodes/files/queuing_config.yaml b/episodes/files/queuing_config.yaml deleted file mode 100644 index 7db5043..0000000 --- a/episodes/files/queuing_config.yaml +++ /dev/null @@ -1,6 +0,0 @@ -# snakemake -j 3 --cluster "sbatch -N 1 -n {resources.tasks} -p node" -cluster: - sbatch - --partition=node - --nodes=1 - --tasks={resources.tasks} diff --git a/episodes/files/snakefiles/Snakefile_ep01 b/episodes/files/snakefiles/Snakefile_ep01 new file mode 100644 index 0000000..32de8e2 --- /dev/null +++ b/episodes/files/snakefiles/Snakefile_ep01 @@ -0,0 +1,5 @@ +rule hostname_login: + output: "hostname_login.txt" + input: + shell: + "hostname > hostname_login.txt" diff --git a/episodes/files/snakefiles/Snakefile_ep02 b/episodes/files/snakefiles/Snakefile_ep02 new file mode 100644 index 0000000..6957cfb --- /dev/null +++ b/episodes/files/snakefiles/Snakefile_ep02 @@ -0,0 +1,13 @@ +localrules: hostname_login + +rule hostname_login: + output: "hostname_login.txt" + input: + shell: + "hostname > hostname_login.txt" + +rule hostname_remote: + output: "hostname_remote.txt" + input: + shell: + "hostname > hostname_remote.txt" diff --git a/episodes/files/snakefiles/Snakefile_ep04 b/episodes/files/snakefiles/Snakefile_ep04 new file mode 100644 index 0000000..b8c9897 --- /dev/null +++ b/episodes/files/snakefiles/Snakefile_ep04 @@ -0,0 +1,22 @@ +localrules: hostname_login + +rule hostname_login: + output: "hostname_login.txt" + input: + shell: + "hostname > hostname_login.txt" + +rule amdahl_run: + output: "p_{parallel_proportion}/runs/amdahl_run_{parallel_tasks}.json" + input: + envmodules: + "amdahl" + resources: + mpi="mpiexec", + # No direct way to access the wildcard in tasks, so we need to do this + # indirectly by declaring a short function that takes the wildcards as an + # argument + tasks=lambda wildcards: int(wildcards.parallel_tasks) + input: + shell: + "{resources.mpi} -n {resources.tasks} amdahl --terse -p {wildcards.parallel_proportion} > {output}" diff --git a/episodes/files/snakefiles/Snakefile_ep05 b/episodes/files/snakefiles/Snakefile_ep05 new file mode 100644 index 0000000..93ec684 --- /dev/null +++ b/episodes/files/snakefiles/Snakefile_ep05 @@ -0,0 +1,28 @@ +localrules: hostname_login, generate_run_files + +rule hostname_login: + output: "hostname_login.txt" + input: + shell: + "hostname > hostname_login.txt" + +rule generate_run_files: + output: "p_{parallel_proportion}_runs.txt" + input: "p_{parallel_proportion}/runs/amdahl_run_6.json" + shell: + "echo {input} done > {output}" + +rule amdahl_run: + output: "p_{parallel_proportion}/runs/amdahl_run_{parallel_tasks}.json" + input: + envmodules: + "amdahl" + resources: + mpi="mpiexec", + # No direct way to access the wildcard in tasks, so we need to do this + # indirectly by declaring a short function that takes the wildcards as an + # argument + tasks=lambda wildcards: int(wildcards.parallel_tasks) + input: + shell: + "{resources.mpi} -n {resources.tasks} amdahl --terse -p {wildcards.parallel_proportion} > {output}" diff --git a/episodes/files/snakefiles/Snakefile_ep06 b/episodes/files/snakefiles/Snakefile_ep06 new file mode 100644 index 0000000..73c17e4 --- /dev/null +++ b/episodes/files/snakefiles/Snakefile_ep06 @@ -0,0 +1,32 @@ +NTASK_SIZES = [1, 2, 3, 4, 5] + +localrules: hostname_login, generate_run_files + +rule hostname_login: + output: "hostname_login.txt" + input: + shell: + "hostname > hostname_login.txt" + +rule generate_run_files: + output: "p_{parallel_proportion}_scalability.jpg" + input: expand("p_{{parallel_proportion}}/runs/amdahl_run_{count}.json", count=NTASK_SIZES) + envmodules: + "matplotlib" + shell: + "python plot_terse_amdahl_results.py {output} {input}" + +rule amdahl_run: + output: "p_{parallel_proportion}/runs/amdahl_run_{parallel_tasks}.json" + input: + envmodules: + "amdahl" + resources: + mpi="mpiexec", + # No direct way to access the wildcard in tasks, so we need to do this + # indirectly by declaring a short function that takes the wildcards as an + # argument + tasks=lambda wildcards: int(wildcards.parallel_tasks) + input: + shell: + "{resources.mpi} -n {resources.tasks} amdahl --terse -p {wildcards.parallel_proportion} > {output}" diff --git a/episodes/files/snakefiles/cluster_profile_ep02/config.yaml b/episodes/files/snakefiles/cluster_profile_ep02/config.yaml new file mode 100644 index 0000000..60685b5 --- /dev/null +++ b/episodes/files/snakefiles/cluster_profile_ep02/config.yaml @@ -0,0 +1,6 @@ +printshellcmds: True +jobs: 3 +executor: slurm +default-resources: + - mem_mb_per_cpu=3600 + - runtime=2 diff --git a/episodes/files/snakefiles/cluster_profile_ep04/config.yaml b/episodes/files/snakefiles/cluster_profile_ep04/config.yaml new file mode 100644 index 0000000..2fbcb60 --- /dev/null +++ b/episodes/files/snakefiles/cluster_profile_ep04/config.yaml @@ -0,0 +1,7 @@ +printshellcmds: True +jobs: 3 +executor: slurm +default-resources: + - mem_mb_per_cpu=3600 + - runtime=2 +use-envmodules: True diff --git a/episodes/snakemake_cluster.md b/episodes/snakemake_cluster.md deleted file mode 100644 index c157a55..0000000 --- a/episodes/snakemake_cluster.md +++ /dev/null @@ -1,63 +0,0 @@ ---- -title: "Snakemake and the Cluster" -teaching: 10 -exercises: 2 ---- - -:::::::::::::::::::::::::::::: questions - -- How can we express a one-task cluster operation in Snakemake? - -:::::::::::::::::::::::::::::::::::::::: - -::::::::::::::::::::::::::::: objectives - -- Write a Snakefile that executes a job on the cluster -- Use MPI options to ensure the job runs in parallel - -:::::::::::::::::::::::::::::::::::::::: - -## Snakemake and the Cluster - -Snakemake has provisions for operating on an HPC cluster. - -Various command-line arguments can be provided to tell -Snakemake not to run things locally, but do run things -via the queuing system instead. - -In this lesson, we will repeat the first module, running -the admahl code on the cluster, but will use snakemake -to make it happen. - -## Write a cluster Snakemake rule file - -Open your favorite editor, do the thing. -Specify resources. Provide command line arguments -to do the cluster operations by hand. - -## Run Snakemake - -Throw the switch! - -:::::::::::::::::::::::::::::: challenge - -How can you control the degree of parallelism -of your cluster task? - -:::::::::::::::: solution - -Use the "mpi" option in the resource block of -the Snakemake rule, and specify the number of tasks. -This will be mapped to the `-n` argument of the -equivalent `sbatch` command. - -::::::::::::::::::::::::: - -:::::::::::::::::::::::::::::::::::::::: - -:::::::::::::::::::::::::::::: keypoints - -- Snakemake rule files can submit cluster jobs. -- There are a lot of options. - -:::::::::::::::::::::::::::::::::::::::: diff --git a/episodes/snakemake_multiple.md b/episodes/snakemake_multiple.md deleted file mode 100644 index 9967018..0000000 --- a/episodes/snakemake_multiple.md +++ /dev/null @@ -1,77 +0,0 @@ ---- -title: "More Complicated Snakefiles" -teaching: 10 -exercises: 2 ---- - -:::::::::::::::::::::::::::::: questions - -- What is a task graph? -- How does the Snakemake file express a task graph? - -:::::::::::::::::::::::::::::::::::::::: - -::::::::::::::::::::::::::::: objectives - -- Write a multiple-rule Snakefile with dependent rules -- Translate between a task graph and rule set - -:::::::::::::::::::::::::::::::::::::::: - -## Snakemake and Workflow - -A Snakefile can contain multiple rules. In the trivial -case, there will be no dependencies between the rules, and -they can all run concurrently. - -A more interesting case is when there are dependencies between -the rules, e.g. when one rule takes the output of another rule -as its input. In this case, the dependent rule (the one that needs -another rule's output) cannot run until the rule it depends on -has completed. - -It's possible to express this relationship by means of -a task graph, whose nodes are tasks, and whose arcs are -input-output relationships between the tasks. - -A Snakemake file is textual description of a task -graph. - -## Write a multi-rule Snakemake rule file - -Open your favorite editor, do the thing. - -## Run Snakemake - -Throw the switch! - -:::::::::::::::::::::::::::::: challenge - -Draw the task graph for your Snakefile. - -Given an example task graph, write a Snakefile that -implements it. - -:::::::::::::::: solution - -The rules in the snakefile are nodes in the task -graph. Two rules are connected by an arc in the task -graph if the output of one rule is the input to the -other. The task graph is directed, so the arc points -from the rule that generates a file as output to the rule -that consumes the same file as input. - -A rule with an output that no other rules consumes is -a terminal rule. - -::::::::::::::::::::::::: - -:::::::::::::::::::::::::::::::::::::::: - -:::::::::::::::::::::::::::::: keypoints - -- Snakemake rule files can be mapped to task graphs -- Tasks are executed as required in dependency order -- Where possible, tasks may run concurrently. - -:::::::::::::::::::::::::::::::::::::::: diff --git a/episodes/snakemake_profiles.md b/episodes/snakemake_profiles.md deleted file mode 100644 index 27c6702..0000000 --- a/episodes/snakemake_profiles.md +++ /dev/null @@ -1,67 +0,0 @@ ---- -title: "Snakemake Profiles" -teaching: 10 -exercises: 2 ---- - -:::::::::::::::::::::::::::::: questions - -- How can we encapsulate our desired snakemake configuration? -- How do we balance non-reptition and customizability? - -:::::::::::::::::::::::::::::::::::::::: - -::::::::::::::::::::::::::::: objectives - -- Write a Snakemake profile for the cluster -- Run the amdahl code with varying degrees of parallelism - with the cluster profile. - -:::::::::::::::::::::::::::::::::::::::: - -## Snakemake Profiles - -Snakemake has a provision for profiles, which allow users -to collect various common settings together in a special -file that snakemake examines when it runs. This lets users -avoid repetition and possible errors of omission for common -settings, and encapsulates some of the cluster complexity -we encoutered in the previous module. - -Not all settings should be in the profile. Users can -choose which ones to make static and which ones to make -adjustable. In our case, we will want to have the freedom -to choose the degree of parallelism, but most of the -cluster arguments will not change, and so can be static -in the profile. - -## Write a Profile - -Do the thing. - -## Run Snakemake - -Throw the switch! - -:::::::::::::::::::::::::::::: challenge - -Write a profile that allows you to choose a -different partition, in addition to the level of -parallelism. - -:::::::::::::::: solution - -The profile files can have variables taken from -the rule file, and in particular can refer to -resources from a rule. - -::::::::::::::::::::::::: - -:::::::::::::::::::::::::::::::::::::::: - -:::::::::::::::::::::::::::::: keypoints - -- Snakemake profiles encapsulate cluster complexity. -- Retaining operational flexibliity is also important. - -:::::::::::::::::::::::::::::::::::::::: diff --git a/episodes/snakemake_single.md b/episodes/snakemake_single.md deleted file mode 100644 index f9a47e4..0000000 --- a/episodes/snakemake_single.md +++ /dev/null @@ -1,69 +0,0 @@ ---- -title: "Introduction to Snakemake" -teaching: 10 -exercises: 2 ---- - -:::::::::::::::::::::::::::::: questions - -- What are Snakemake rules? -- Why do Snakemake rules not always run? - -:::::::::::::::::::::::::::::::::::::::: - -::::::::::::::::::::::::::::: objectives - -- Write a single-rule Snakefile and execute it with Snakemake -- Predict whether the rule will run or not - -:::::::::::::::::::::::::::::::::::::::: - -## Snakemake - -Snakemake is a workflow tool. It takes as input -a description of the work that you would like the computer -to do, and when run, does the work that you have -asked for. - -The description of the work takes the form of a -series of rules, written in a special format into a -Snakefile. Rules have outputs, and the Snakefile -and generated output files make up the system state. - -## Write a Snakemake rule file - -Open your favorite editor, do the thing. - -## Run Snakemake - -Throw the switch! - -:::::::::::::::::::::::::::::: challenge - -Remove the output file, and run Snakemake. Then -run it again. Edit the output file, and run it -a third time. For which of these invocations -does Snakemake do non-trivial work? - -:::::::::::::::: solution - -The rule does not get executed the second time. The -Snakemake infrastructure is stateful, and knows that -the required outputs are up to date. - -The rule also does not get executed the third time. -The output is not the output from the rule, but the -Snakemake infrastructure doesn't know that, it only -checks the file time-stamp. Editing Snakemake-manipulated -files can get you into an inconsistent state. - -::::::::::::::::::::::::: - -:::::::::::::::::::::::::::::::::::::::: - -:::::::::::::::::::::::::::::: keypoints - -- Snakemake is an indirect way of running executables -- Snakemake has a notion of system state, and can be fooled. - -::::::::::::::::::::::::::::::::::::::::