From 4d7d7cb15a0e0d5ae202db5c64cf0983e7efd2b2 Mon Sep 17 00:00:00 2001 From: Alan O'Cais Date: Tue, 30 Jan 2024 17:22:52 +0100 Subject: [PATCH] Tweak all episodes --- episodes/01-introduction.md | 49 +++++---- episodes/02-snakemake_on_the_cluster.md | 41 ++++---- episodes/03-placeholders.md | 5 +- episodes/04-snakemake_and_mpi.md | 133 +++++++++++++----------- episodes/05-chaining_rules.md | 45 ++++---- episodes/06-expansion.md | 43 +++++--- 6 files changed, 173 insertions(+), 143 deletions(-) diff --git a/episodes/01-introduction.md b/episodes/01-introduction.md index d4940ba..159154e 100644 --- a/episodes/01-introduction.md +++ b/episodes/01-introduction.md @@ -67,33 +67,37 @@ rule hostname_login: ## Key points about this file 1. The file is named `Snakefile` - with a capital `S` and no file extension. -1. Some lines are indented. Indents must be with space characters, not tabs. See the - setup section for how to make your text editor do this. -1. The rule definition starts with the keyword `rule` followed by the rule name, then a colon. -1. We named the rule `hostname`. You may use letters, numbers or underscores, but the rule name - must begin with a letter and may not be a keyword. +1. Some lines are indented. Indents must be with space characters, not tabs. See + the setup section for how to make your text editor do this. +1. The rule definition starts with the keyword `rule` followed by the rule name, + then a colon. +1. We named the rule `hostname_login`. You may use letters, numbers or + underscores, but the rule name must begin with a letter and may not be a + keyword. 1. The keywords `input`, `output`, `shell` are all followed by a colon. 1. The file names and the shell command are all in `"quotes"`. -1. The output filename is given before the input filename. In fact, Snakemake doesn't care what - order they appear in but we give the output first throughout this course. We'll see why soon. -1. In this use case there is no input file for the command so we leave this blank. +1. The output filename is given before the input filename. In fact, Snakemake + doesn't care what order they appear in but we give the output first + throughout this course. We'll see why soon. +1. In this use case there is no input file for the command so we leave this + blank. ::: -Back in the shell we'll run our new rule. At this point, if there were any missing quotes, bad -indents, etc. we may see an error. +Back in the shell we'll run our new rule. At this point, if there were any +missing quotes, bad indents, etc. we may see an error. ```bash -$ snakemake -j1 -p hostname_login.txt +$ snakemake -j1 -p hostname_login ``` ::: callout ## `bash: snakemake: command not found...` -If your shell tells you that it cannot find the command `snakemake` then we need to make the -software available somehow. In our case, this means searching for the module that we need to -load: +If your shell tells you that it cannot find the command `snakemake` then we need +to make the software available somehow. In our case, this means searching for +the module that we need to load: ```bash module spider snakemake ``` @@ -148,17 +152,16 @@ What does the `-p` option in the `snakemake` command above do? 1. Tells Snakemake to only run one process at a time 1. Prompts the user for the correct input file -*Hint: you can search in the text by pressing `/`, and quit back to the shell with `q`* +*Hint: you can search in the text by pressing `/`, and quit back to the shell +with `q`* :::::: solution - (2) Prints the shell commands that are being run to the terminal -This is such a useful thing we don't know why it isn't the default! The `-j1` option is what -tells Snakemake to only run one process at a time, and we'll stick with this for now as it -makes things simpler. The `-F` option tells Snakemake to always overwrite output files, and -we'll learn about protected outputs much later in the course. Answer 4 is a total red-herring, -as Snakemake never prompts interactively for user input. +This is such a useful thing we don't know why it isn't the default! The `-j1` +option is what tells Snakemake to only run one process at a time, and we'll +stick with this for now as it makes things simpler. Answer 4 is a total +red-herring, as Snakemake never prompts interactively for user input. :::::: ::: @@ -167,7 +170,7 @@ as Snakemake never prompts interactively for user input. - "Before running Snakemake you need to write a Snakefile" - "A Snakefile is a text file which defines a list of rules" - "Rules have inputs, outputs, and shell commands to be run" -- "You tell Snakemake what file to make and it will run the shell command defined in the - appropriate rule" +- "You tell Snakemake what file to make and it will run the shell command + defined in the appropriate rule" ::: diff --git a/episodes/02-snakemake_on_the_cluster.md b/episodes/02-snakemake_on_the_cluster.md index c7dd3ae..30eba35 100644 --- a/episodes/02-snakemake_on_the_cluster.md +++ b/episodes/02-snakemake_on_the_cluster.md @@ -35,12 +35,11 @@ Nothing to be done (all requested files are present and up to date). ``` Nothing happened! Why not? When it is asked to build a target, Snakemake checks -the 'last modification -time' of both the target and its dependencies. If any dependency has been -updated since the target, then the actions are re-run to update the target. -Using this approach, Snakemake knows to only rebuild the files that, either -directly or indirectly, depend on the file that changed. This is called an -_incremental build_. +the 'last modification time' of both the target and its dependencies. If any +dependency has been updated since the target, then the actions are re-run to +update the target. Using this approach, Snakemake knows to only rebuild the +files that, either directly or indirectly, depend on the file that changed. This +is called an _incremental build_. ::: callout ## Incremental Builds Improve Efficiency @@ -53,12 +52,11 @@ more efficient. ::: challenge ## Running on the cluster -We need another rule now that executes the `hostname` on the cluster. Create the -rule in your Snakefile and try to execute it on cluster with the options -`--executor slurm` to `snakemake` +We need another rule now that executes the `hostname` on the _cluster_. Create +a new rule in your Snakefile and try to execute it on cluster with the option +`--executor slurm` to `snakemake`. :::::: solution - The rule is almost identical to the previous rule save for the rule name and output file: @@ -109,14 +107,13 @@ Complete log: .snakemake/log/2024-01-29T180346.788174.snakemake.log Note all the warnings that Snakemake is giving us about the fact that the rule may not be able to execute on our cluster as we may not have given enough information. Luckily for us, this actually works on our cluster and we can take -a look in the output file we asked for, `hostname_remote.txt`: +a look in the output file the new rule creates, `hostname_remote.txt`: ```bash [ocaisa@node1 ~]$ cat hostname_remote.txt ``` ```output tmpnode1.int.jetstream2.hpc-carpentry.org ``` - :::::: ::: @@ -167,8 +164,10 @@ the help of a translation table: | `--cpus-per-task` | `cpus_per_task` | number of cpus per task (in case of SMP, rather use `threads`) | | `--nodes` | `nodes` | number of nodes | -The warnings given by Snakemake hinted that we need to provide these options. -One way to do it is to provide them is as part of the Snakemake rule, e.g., +The warnings given by Snakemake hinted that we may need to provide these +options. One way to do it is to provide them is as part of the Snakemake rule +using the keyword `resources`, +e.g., ```python rule: input: ... @@ -178,8 +177,9 @@ rule: runtime: ``` and we can also use the profile to define default values for these options to -use with our project. For example, the available memory on our cluster is about -4GB per core, so we can add that to our profile: +use with our project, using the keyword `default-resources`. For example, the +available memory on our cluster is about 4GB per core, so we can add that to our +profile: ```yaml printshellcmds: True jobs: 3 @@ -189,7 +189,7 @@ default-resources: ``` :::challenge -We know that our problem runs in a very short time. Make the default length of +We know that our problem runs in a very short time. Change the default length of our jobs to two minutes for Slurm. ::::::solution @@ -227,10 +227,9 @@ Slurm executor (which is what we are doing via our new profile) this won't happen any more. So how do we force the rule to run on the login node? -Well, it's no surprise that some Snakemake rules perform trivial tasks where job -submission might be -overkill (e.g., less than 1 minute worth of compute time). Similar to our case, -it would be a better +Well, in the case where a Snakemake rule performs a trivial task job submission +might be overkill (e.g., less than 1 minute worth of compute time). Similar to +our case, it would be a better idea to have these rules execute locally (i.e. where the `snakemake` command is run) instead of as a job. Snakemake lets you indicate which rules should always run locally with the `localrules` keyword. Let's define `hostname_login` as a diff --git a/episodes/03-placeholders.md b/episodes/03-placeholders.md index 9dde975..8e93283 100644 --- a/episodes/03-placeholders.md +++ b/episodes/03-placeholders.md @@ -6,11 +6,9 @@ exercises: 30 ::: questions - "How do I make a generic rule?" -- "How does Snakemake decide what rule to run?" ::: ::: objectives -- "Understand the basic steps Snakemake goes through when running a workflow" - "See how Snakemake deals with some errors" ::: @@ -71,6 +69,9 @@ replace them with appropriate values - `{input}` with the full name of the input file, and `{output}` with the full name of the output file -- before running the command. +`{resources}` is also a placeholder, and we can access a named element of the +`{resources}` with the notation `{resources.runtime}` (for example). + :::keypoints - "Snakemake rules are made more generic with placeholders" - "Placeholders in the shell part of the rule are replaced with values based on the chosen diff --git a/episodes/04-snakemake_and_mpi.md b/episodes/04-snakemake_and_mpi.md index 6fb3aad..0e7b41a 100644 --- a/episodes/04-snakemake_and_mpi.md +++ b/episodes/04-snakemake_and_mpi.md @@ -23,9 +23,9 @@ environment module. ::: challenge -Locate and load the `amdahl` module and then replace our `hostname_remote` rule -with a version that runs `amdahl`. (Don't worry about parallel MPI just yet, run -it with a single CPU, `mpiexec -n 1 amdahl`). +Locate and load the `amdahl` module and then _replace_ our `hostname_remote` +rule with a version that runs `amdahl`. (Don't worry about parallel MPI just +yet, run it with a single CPU, `mpiexec -n 1 amdahl`). Does your rule execute correctly? If not look through the log files to find out why? @@ -43,7 +43,7 @@ rule amdahl_run: output: "amdahl_run.txt" input: shell: - "mpiexec -n 1 amdahl > amdahl_run.txt" + "mpiexec -n 1 amdahl > {output}" ``` However, when we try to execute the rule we get an error (unless you already have a different version of `amdahl` available in your path). Snakemake @@ -68,7 +68,7 @@ Executable: amdahl ``` So, even though we loaded the module before running the workflow, our Snakemake rule didn't find the executable. That's because the Snakemake rule -is running in a clean runtime environment, and we need to somehow tell it to +is running in a clean _runtime environment_, and we need to somehow tell it to load the necessary environment module before trying to execute the rule. :::::: @@ -97,7 +97,7 @@ Adding these lines are not enough to make the rule execute however. Not only do you have to tell Snakemake what modules to load, but you also have to tell it to use environment modules in general (since the use of environment modules is considered to make your runtime environment less reproducible as the available -modules may differ from cluster to cluster). This require you to give Snakemake +modules may differ from cluster to cluster). This requires you to give Snakemake an additonal option ```bash snakemake --profile cluster_profile --use-envmodules amdahl_run @@ -167,7 +167,7 @@ file for every run. It would be great if we can somehow indicate in the `output` the value that we want to use for `tasks`...and have Snakemake pick that up. We could use a _wildcard_ in the `output` to allow us to -define `tasks` we wish to use. The syntax for such a wildcard looks like +define the `tasks` we wish to use. The syntax for such a wildcard looks like ```python output: "amdahl_run_{parallel_tasks}.txt" ``` @@ -187,15 +187,49 @@ input and output is what tells Snakemake how to match input files to output files. If two rules use a wildcard with the same name then Snakemake will treat them as -different entities -- rules in Snakemake are self-contained in this way. +different entities - rules in Snakemake are self-contained in this way. In the `shell` line you can reference the wildcard with `{wildcards.parallel_tasks}` ::: -We could use a wildcard in the `output` to allow us to -define `tasks` we wish to use. This could look like +## Snakemake order of operations + +We're only just getting started with some simple rules, but it's worth thinking about exactly what Snakemake is doing when you run it. There are three distinct phases: + +1. Prepares to run: + 1. Reads in all the rule definitions from the Snakefile +1. Plans what to do: + 1. Sees what file(s) you are asking it to make + 1. Looks for a matching rule by looking at the `output`s of all the rules it knows + 1. Fills in the wildcards to work out the `input` for this rule + 1. Checks that this input file (if required) is actually available +1. Runs the steps: + 1. Creates the directory for the output file, if needed + 1. Removes the old output file if it is already there + 1. Only then, runs the shell command with the placeholders replaced + 1. Checks that the command ran without errors *and* made the new output file as expected + +::: callout +## Dry-run (`-n`) mode + +It's often useful to run just the first two phases, so that Snakemake will plan out the jobs to +run, and print them to the screen, but never actually run them. This is done with the `-n` +flag, eg: + +```bash +> $ snakemake -n ... +``` +::: + +The amount of checking may seem pedantic right now, but as the workflow gains more steps this will +become very useful to us indeed. + +## Using wildcards in our rule + +We would like to use a wildcard in the `output` to allow us to +define the number of `tasks` we wish to use. Based on what we've seen so far, +you might imagine this could look like ```python rule amdahl_run: output: "amdahl_run_{parallel_tasks}.txt" @@ -212,10 +246,12 @@ rule amdahl_run: but there are two problems with this: * The only way for Snakemake to know the value of the wildcard is for the user - to explicitly request a concrete output file: + to explicitly request a concrete output file (rather than call the rule): ```bash snakemake --profile cluster_profile amdahl_run_2.txt ``` + This is perfectly valid, as Snakemake can figure out that it has a rule that + can match that filename. * The bigger problem is that even doing that does not work, it seems we cannot use a wildcard for `tasks`: ```output @@ -223,21 +259,23 @@ but there are two problems with this: SLURM job submission failed. The error message was sbatch: error: Invalid numeric value "{parallel_tasks}" for --ntasks. ``` -Unfortunately for us, there is no direct way for us to access the wildcards. The +Unfortunately for us, there is no direct way for us to access the wildcards +for `tasks`. The reason for this is that Snakemake tries to use the value of `tasks` during it's initialisation stage, which is before we know the value of the wildcard. We need to defer the determination of `tasks` to later on. This can be achieved by specifying an input function instead of a value for this -scenario. The solution then is to write a one-time use function that -has no name to manipulate Snakmake. These kinds of functions are called either -anonymous functions or lamdba functions (both mean the same thing), and are a -feature of Python (and other programming languages). +scenario. The solution then is to write a one-time use function to manipulate +Snakemake into doing this for us. Since the function is specifically for the +rule, we can use a one-line function without a name. These kinds of functions +are called either anonymous functions or lamdba functions (both mean the same +thing), and are a feature of Python (and other programming languages). To define a lambda function in python, the general syntax is as follows: ```python lambda x: x + 54 ``` -Since a function _can_ take the wildcards as arguments, we can use that to set +Since our function _can_ take the wildcards as arguments, we can use that to set the value for `tasks`: ```python rule amdahl_run: @@ -271,8 +309,9 @@ this is just as true with Snakefiles. Since our rule is now capable of generating an arbitrary number of output files things could get very crowded in our current directory. It's probably best then -to put the runs into a separate folder. We can just add the folder directly to -our `output`: +to put the runs into a separate folder to keep things tidy. We can add the +folder directly to our `output` and Snakemake will take of directory creation +for us: ```python rule amdahl_run: @@ -293,9 +332,12 @@ rule amdahl_run: ::: challenge -Create an output file (under the `run` folder) for the case where we have 6 +Create an output file (under the `runs` folder) for the case where we have 6 parallel tasks +(HINT: Remember that Snakemake needs to be able to match the requested file to +the `output` from a rule) + :::::: solution ```bash @@ -328,8 +370,9 @@ options: ``` The option we are looking for is `--terse`, and that will make `amdahl` print output in a format that is much easier to process, JSON. JSON format in a file -typically uses the file extension `.json` so let's add that option to our shell command -and change the file format of the output: +typically uses the file extension `.json` so let's add that option to our +`shell` command _and_ change the file format of the `output` to match our new +command: ```python rule amdahl_run: @@ -349,8 +392,9 @@ rule amdahl_run: ``` There was another parameter for `amdahl` that caught my eye. `amdahl` has an -option `--parallel-proportion` (or `-p`)which we might be interested in -changing. This has an impact on the values we get in our results so let's add +option `--parallel-proportion` (or `-p`) which we might be interested in +changing as it changes the behaviour of the code,and therefore has an impact on +the values we get in our results. Let's add another directory layer to our output format to reflect a particular choice for this value. We can use a wildcard so we done have to choose the value right away: @@ -387,43 +431,12 @@ snakemake --profile cluster_profile p_0.999/runs/amdahl_run_6.json ::: -## Snakemake order of operations - -We're only just getting started with some simple rules, but it's worth thinking about exactly what Snakemake is doing when you run it. There are three distinct phases: - -1. Prepares to run: - 1. Reads in all the rule definitions from the Snakefile -1. Plans what to do: - 1. Sees what file(s) you are asking it to make - 1. Looks for a matching rule by looking at the `output`s of all the rules it knows - 1. Fills in the wildcards to work out the `input` for this rule - 1. Checks that this input file (if required) is actually available -1. Runs the steps: - 1. Creates the directory for the output file, if needed - 1. Removes the old output file if it is already there - 1. Only then, runs the shell command with the placeholders replaced - 1. Checks that the command ran without errors *and* made the new output file as expected - -::: callout -## Dry-run (`-n`) mode - -It's often useful to run just the first two phases, so that Snakemake will plan out the jobs to -run, and print them to the screen, but never actually run them. This is done with the `-n` -flag, eg: - -```bash -> $ snakemake -n ... -``` -::: - -The amount of checking may seem pedantic right now, but as the workflow gains more steps this will -become very useful to us indeed. - ::: keypoints -- "Snakemake chooses the appropriate rule by replacing wildcards such that the output matches - the target" -- "Snakemake checks for various error conditions and will stop if it sees a problem" +- "Snakemake chooses the appropriate rule by replacing wildcards such that the + output matches the target" +- "Snakemake checks for various error conditions and will stop if it sees a + problem" ::: diff --git a/episodes/05-chaining_rules.md b/episodes/05-chaining_rules.md index 878ab13..b8cdbfb 100644 --- a/episodes/05-chaining_rules.md +++ b/episodes/05-chaining_rules.md @@ -10,15 +10,13 @@ exercises: 30 ::: ::: objectives -- "Use Snakemake to filter and then count the lines in a FASTQ file" -- "Add an RNA quantification step in the data analysis" -- "See how Snakemake deals with missing outputs" +- "" ::: ## A pipeline of multiple rules -We now have a rule that can generate output for any value of `p` and any number -tasks, we just need to call Snakemake with the parameters that we want: +We now have a rule that can generate output for any value of `-p` and any number +of tasks, we just need to call Snakemake with the parameters that we want: ```bash snakemake --profile cluster_profile p_0.999/runs/amdahl_run_6.json ``` @@ -57,7 +55,8 @@ localrules: hostname_login, generate_run_files ::: -Now let's run the new rule: +Now let's run the new rule (remember we need to request the output file by name +as the `output` in our rule contains a wildcard pattern): ```bash [ocaisa@node1 ~]$ snakemake --profile cluster_profile/ p_0.999_runs.txt ``` @@ -128,7 +127,8 @@ Look at the logging messages that Snakemake prints in the terminal. What has hap This, in a nutshell, is how we build workflows in Snakemake. 1. Define rules for all the processing steps -1. Choose `input` and `output` naming patterns that allow Snakemake to link the rules +1. Choose `input` and `output` naming patterns that allow Snakemake to link the + rules 1. Tell Snakemake to generate the final output file(s) If you are used to writing regular scripts this takes a little @@ -157,34 +157,35 @@ you. Snakemake has a dedicated rule field for outputs that are [log files](https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#log-files), -and these are mostly treated as regular outputs except that log files are not removed if the job -produces an error. This means you can look at the log to help diagnose the error. In a real -workflow this can be very useful, but in terms of learning the fundementals of Snakemake we'll -stick with regular `input` and `output` fields here. +and these are mostly treated as regular outputs except that log files are not +removed if the job produces an error. This means you can look at the log to help +diagnose the error. In a real workflow this can be very useful, but in terms of +learning the fundamentals of Snakemake we'll stick with regular `input` and +`output` fields here. ::: - - ::: callout ## Errors are normal -Don't be disheartened if you see errors like the one above when first testing your new Snakemake -pipelines. There is a lot that can go wrong when writing a new workflow, and you'll normally need -several iterations to get things just right. One advantage of the Snakemake approach compared to -regular scripts is that Snakemake fails fast when there is a problem, rather than ploughing on -and potentially running junk calculations on partial or corrupted data. Another advantage is that -when a step fails we can safely resume from where we left off, as we'll see in the next episode. +Don't be disheartened if you see errors when first testing +your new Snakemake pipelines. There is a lot that can go wrong when writing a +new workflow, and you'll normally need several iterations to get things just +right. One advantage of the Snakemake approach compared to regular scripts is +that Snakemake fails fast when there is a problem, rather than ploughing on +and potentially running junk calculations on partial or corrupted data. Another +advantage is that when a step fails we can safely resume from where we left off. ::: ::: keypoints -- "Snakemake links rules by iteratively looking for rules that make missing inputs" +- "Snakemake links rules by iteratively looking for rules that make missing + inputs" - "Rules may have multiple named inputs and/or outputs" -- "If a shell command does not yield an expected output then Snakemake will regard that as a - failure" +- "If a shell command does not yield an expected output then Snakemake will + regard that as a failure" ::: diff --git a/episodes/06-expansion.md b/episodes/06-expansion.md index 6c2b49c..e332dbe 100644 --- a/episodes/06-expansion.md +++ b/episodes/06-expansion.md @@ -41,17 +41,20 @@ Global variables should be added before the rules in the Snakefile. NTASK_SIZES = [1, 2, 3, 4, 5] ``` -* Unlike with variables in shell scripts, we can put spaces around the `=` sign, but they are +* Unlike with variables in shell scripts, we can put spaces around the `=` sign, + but they are not mandatory. +* The lists of quoted strings are enclosed in square brackets and + comma-separated. If you know any Python you'll recognise this as Python list + syntax. +* A good convention is to use capitalized names for these variables, but this is not mandatory. -* The lists of quoted strings are enclosed in square brackets and comma-separated. If you know any - Python you'll recognise this as Python list syntax. -* A good convention is to use capitalized names for these variables, but this is not mandatory. -* Although these are referred to as variables, you can't actually change the values once the - workflow is running, so lists defined this way are more like constants. +* Although these are referred to as variables, you can't actually change the + values once the workflow is running, so lists defined this way are more like + constants. ## Using a Snakemake rule to define a batch of outputs -Now let's update our Snakefile to leverage the new global variable to create a +Now let's update our Snakefile to leverage the new global variable and create a list of files: ```python rule generate_run_files: @@ -76,23 +79,25 @@ to request a specific file: snakemake --profile cluster_profile/ p_0.999_runs.txt ``` -If you don't specify a target rule name or any file names on the command line when running -Snakemake, the default is to use **the first rule** in the Snakefile as the target. +If you don't specify a target rule name or any file names on the command line +when running Snakemake, the default is to use **the first rule** in the +Snakefile as the target. ::: callout ## Rules as targets -Giving the name of a rule to Snakemake on the command line only works when that rule has -*no wildcards* in the outputs, because Snakemake has no way to know what the desired wildcards -might be. You will see the error "Target rules may not contain wildcards." This can also happen -when you don't supply any explicit targets on the command line at all, and Snakemake tries to run -the first rule defined in the Snakefile. +Giving the name of a rule to Snakemake on the command line only works when that +rule has *no wildcards* in the outputs, because Snakemake has no way to know +what the desired wildcards might be. You will see the error "Target rules may +not contain wildcards." This can also happen when you don't supply any explicit +targets on the command line at all, and Snakemake tries to runthe first rule +defined in the Snakefile. ::: ## Rules that combine multiple inputs -Our *`generate_run_files`* rule is a rule which takes a list of input files. The +Our `generate_run_files` rule is a rule which takes a list of input files. The length of that list is not fixed by the rule, but can change based on `NTASK_SIZES`. @@ -174,6 +179,14 @@ snakemake --profile cluster_profile/ p_0.8_scalability.jpg ::: +::: challenge +## Bonus round + +Create a final rule that can be called directly and generates a scaling plot for +3 different values of `p`. + +::: + ::: keypoints - "Use the `expand()` function to generate lists of filenames you want to combine" - "Any `{input}` to a rule can be a variable-length list"