Skip to content

Commit

Permalink
Move (nextstrain) alignment into separate profiles
Browse files Browse the repository at this point in the history
This breaks each of our main nextstrain profiles into two profiles each
- one to perform alignment (etc) and a second to run the phylogenetics.
This is part of a wider effort to generate alignments as soon as new
sequences are available.

Using GISAID as an example (also implemented for open, read the
following but replace `gisaid` with `open` ):

The first profile `nextstrain_profiles/nextstrain-gisaid-preprocess`
takes sequences and metadata and produces three files which we upload
to S3, `results/filtered_gisaid.fasta.xz`,
`results/masked_gisaid.fasta.xz` and `results/aligned_gisaid.fasta.xz`.
While this is a separate profile, the `builds.yaml` file is around a
dozen lines of code. I chose this approach rather than using another
`--configfile` in the existing profile as the way Snakemake overlays
configs (or doesn't) isn't intuitive, and stubby profiles are easier to
reason with.

The second profile, `nextstrain_profiles/nextstrain-gisaid` is
relatively unchanged, except that we now start from
`results/filtered_gisaid.fasta.xz` and thus should be much faster to
run. Note that the `upload`  rule here will no longer upload the files
which are now within the previous profile. It is still possible to start
this workflow from (unaligned) sequences, however there should be no
reason to do so.

The sets of uploaded files are defined by config["upload"]. This allows
for a profile to be created which uploads both preprocessing files
and build files, if desired.

An introduction to the different profiles, and exact commands to run
each profile have been added to docs/dev_docs.md.

Note that both profiles will run `rule sanitize_metadata`, as each
depend on its output which is not uploaded as an intermediate file.

None of these changes should affect non-nextstrain-core builds /
profiles.
  • Loading branch information
jameshadfield committed Oct 5, 2021
1 parent 878fa8e commit c4934d9
Show file tree
Hide file tree
Showing 10 changed files with 160 additions and 25 deletions.
61 changes: 59 additions & 2 deletions docs/dev_docs.md
Original file line number Diff line number Diff line change
Expand Up @@ -68,13 +68,70 @@ Prior to merging a pull request that introduces a new backward incompatible chan
We do not release new minor versions for new features, but you should document new features in the change log as part of the corresponding pull request under a heading for the date those features are merged.


## Running Core Nextstrain Builds

The "core" nextstrain builds consist of a global analysis and six regional analyses, performed independently for GISAID data and open data (currently open data is GenBank data).
Stepping back, the process can be broken into three steps:
1. Ingest and curation of raw data. This is performed by the [ncov-ingest](https://github.com/nextstrain/ncov-ingest/) repo and resulting files are uploaded to S3 buckets.
2. Preprocessing of data (alignment, masking and QC filtering). This is performed by the profiles `nextstrain_profiles/nextstrain-open-preprocess` and `nextstrain_profiles/nextstrain-gisaid-preprocess`. The resulting files are uploaded to S3 buckets by the `upload` rule.
3. Phylogenetic builds, which start from the files produced by the previous step. This is performed by the profiles `nextstrain_profiles/nextstrain-open` and `nextstrain_profiles/nextstrain-gisaid`. The resulting files are uploaded to S3 buckets by the `upload` rule.


### Manually running preprocessing

To run these pipelines without uploading the results:
```sh
snakemake -pf results/filtered_open.fasta.xz --profile nextstrain_profiles/nextstrain-open-preprocess
snakemake -pf results/filtered_gisaid.fasta.xz --profile nextstrain_profiles/nextstrain-gisaid-preprocess
```

If you wish to upload the resulting information, you should run the `upload` rule.
Optionally, you may wish to define a specific `S3_DST_BUCKET` to avoid overwriting the files already present on the S3 buckets:
```sh
snakemake -pf upload --profile nextstrain_profiles/nextstrain-open-preprocess \
--config S3_DST_BUCKET=nextstrain-staging/files/ncov/open/trial/TRIAL_NAME
snakemake -pf upload --profile nextstrain_profiles/nextstrain-gisaid-preprocess \
--config S3_DST_BUCKET=nextstrain-ncov-private/trial/TRIAL_NAME
```

### Manually running phylogenetic builds

To run these pipelines locally, without uploading the results:
```sh
snakemake -pf all --profile nextstrain_profiles/nextstrain-open
snakemake -pf all --profile nextstrain_profiles/nextstrain-gisaid
```
You can replace `all` with, for instance, `auspice/ncov_open_global.json` to avoid building all regions.
The resulting dataset(s) can be visualised in the browser by running `auspice view --datasetDir auspice`.

If you wish to upload the resulting information, you should run the `upload` and/or `deploy` rules.
The `upload` rule uploads the resulting files, including intermediate files, to specific S3 buckets; this rule uses the `S3_DST_BUCKET` config parameter.
The `deploy` rule uploads the dataset files such that they are accessible via nextstrain URLs (e.g. nextstrain.org/ncov/gisaid/global); this rule uses the `deploy_url` and `auspice_json_prefix` parameters.
You may wish to overwrite these parameters for your local runs to avoid overwriting data which is already present.
For instance, here are the commands used by the trial builds action (see below):
```sh
snakemake -pf upload deploy \
--profile nextstrain_profiles/nextstrain-open-preprocess \
--config \
S3_DST_BUCKET=nextstrain-staging/files/ncov/open/trial/TRIAL_NAME \
deploy_url=s3://nextstrain-staging/ \
auspice_json_prefix=ncov_open_trial_TRIAL_NAME
snakemake -pf upload deploy \
--profile nextstrain_profiles/nextstrain-gisaid-preprocess \
--config \
S3_DST_BUCKET=nextstrain-ncov-private/trial/TRIAL_NAME \
deploy_url=s3://nextstrain-staging/ \
auspice_json_prefix=ncov_gisaid_trial_TRIAL_NAME
```


## Triggering routine builds

Typically, everything’s triggered from the `ncov-ingest` pipeline’s `trigger` command.
After updating the intermediate files, that command will run this `ncov` pipeline force-requiring the rules `deploy` and `upload`.
After updating the intermediate files, that command will run the phylogenetic `ncov` pipelines (step 3, above) force-requiring the rules `deploy` and `upload`.

## Triggering trial builds

This repository contains a GitHub Action `trial-build` which is manually run [via github.com](https://github.com/nextstrain/ncov/actions/workflows/trial-build.yml).
This repository contains a GitHub Action `trial-build` which is manually run [via github.com](https://github.com/nextstrain/ncov/actions/workflows/trial-build.yml) and runs both of the phylogenetic build pipelines (gisaid + open).
This will ask for a “trial name” and upload intermediate files to `nextstrain-ncov-private/trial/$TRIAL_NAME` and `nextstrain-staging/files/ncov/open/trial/$TRIAL_NAME`.
Auspice datasets for visualisation will be available at `https://nextstrain.org/staging/ncov/gisaid/trial/$TRIAL_NAME/$BUILD_NAME` and `https://nextstrain.org/staging/ncov/open/trial/$TRIAL_NAME/$BUILD_NAME`.
22 changes: 22 additions & 0 deletions nextstrain_profiles/nextstrain-gisaid-preprocess/builds.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
custom_rules:
- workflow/snakemake_rules/export_for_nextstrain.smk

# These parameters are only used by the `export_for_nextstrain` rule and shouldn't need to be modified.
# To modify the s3 _source_ bucket, specify this directly in the `inputs` section of the config.
# P.S. These are intentionally set as top-level keys as this allows command-line overrides.
S3_DST_BUCKET: "nextstrain-ncov-private"
S3_DST_COMPRESSION: "xz"
S3_DST_ORIGINS: ["gisaid"]

upload:
- preprocessing-files

inputs:
- name: gisaid
metadata: "s3://nextstrain-ncov-private/metadata.tsv.gz"
sequences: "s3://nextstrain-ncov-private/sequences.fasta.xz"

# Deploy and Slack options are related to Nextstrain live builds and don't need to be modified for local builds
deploy_url: s3://nextstrain-data
slack_token: ~
slack_channel: "#ncov-gisaid-updates"
10 changes: 10 additions & 0 deletions nextstrain_profiles/nextstrain-gisaid-preprocess/config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
configfile:
- defaults/parameters.yaml
- nextstrain_profiles/nextstrain-gisaid-preprocess/builds.yaml

keep-going: True
printshellcmds: True
show-failed-logs: True
restart-times: 1
reason: True
stats: stats.json
15 changes: 6 additions & 9 deletions nextstrain_profiles/nextstrain-gisaid/builds.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -11,22 +11,19 @@ custom_rules:
S3_DST_BUCKET: "nextstrain-ncov-private"
S3_DST_COMPRESSION: "xz"
S3_DST_ORIGINS: ["gisaid"]
upload:
- build-files

genes: ["ORF1a", "ORF1b", "S", "ORF3a", "E", "M", "ORF6", "ORF7a", "ORF7b", "ORF8", "N", "ORF9b"]
use_nextalign: true

# NOTE for shepherds -- there are commented out inputs here, you can
# uncomment them to start the pipeline at that stage.
# E.g. if you uncomment `filtered` then the pipeline
# will start by downloading that file and proceeding straight to
# subsampling
# Note: we have a separate profile for aligning GISAID sequences. This is triggered
# as soon as new sequences are available. This workflow is thus intended to be
# started from the filtered alignment. james, sept 2021
inputs:
- name: gisaid
metadata: "s3://nextstrain-ncov-private/metadata.tsv.gz"
sequences: "s3://nextstrain-ncov-private/sequences.fasta.xz"
# aligned: "s3://nextstrain-ncov-private/aligned.fasta.xz"
# masked: "s3://nextstrain-ncov-private/masked.fasta.xz"
# filtered: "s3://nextstrain-ncov-private/filtered.fasta.xz"
filtered: "s3://nextstrain-ncov-private/filtered.fasta.xz"

# Define locations for which builds should be created.
# For each build we specify a subsampling scheme via an explicit key.
Expand Down
22 changes: 22 additions & 0 deletions nextstrain_profiles/nextstrain-open-preprocess/builds.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
custom_rules:
- workflow/snakemake_rules/export_for_nextstrain.smk

# These parameters are only used by the `export_for_nextstrain` rule and shouldn't need to be modified.
# To modify the s3 _source_ bucket, specify this directly in the `inputs` section of the config.
# P.S. These are intentionally set as top-level keys as this allows command-line overrides.
S3_DST_BUCKET: "nextstrain-data/files/ncov/open"
S3_DST_COMPRESSION: "xz"
S3_DST_ORIGINS: ["open"]

upload:
- preprocessing-files

inputs:
- name: open
metadata: "s3://nextstrain-data/files/ncov/open/metadata.tsv.gz"
sequences: "s3://nextstrain-data/files/ncov/open/sequences.fasta.xz"

# Deploy and Slack options are related to Nextstrain live builds and don't need to be modified for local builds
deploy_url: s3://nextstrain-data
slack_token: ~
slack_channel: "#ncov-genbank-updates"
10 changes: 10 additions & 0 deletions nextstrain_profiles/nextstrain-open-preprocess/config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
configfile:
- defaults/parameters.yaml
- nextstrain_profiles/nextstrain-open-preprocess/builds.yaml

keep-going: True
printshellcmds: True
show-failed-logs: True
restart-times: 1
reason: True
stats: stats.json
15 changes: 6 additions & 9 deletions nextstrain_profiles/nextstrain-open/builds.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -11,19 +11,16 @@ custom_rules:
S3_DST_BUCKET: "nextstrain-data/files/ncov/open"
S3_DST_COMPRESSION: "xz"
S3_DST_ORIGINS: ["open"]
upload:
- build-files

# NOTE for shepherds -- there are commented out inputs here, you can
# uncomment them to start the pipeline at that stage.
# E.g. if you uncomment `filtered` then the pipeline
# will start by downloading that file and proceeding straight to
# subsampling
# Note: we have a separate profile for aligning open sequences. This is triggered
# as soon as new sequences are available. This workflow is thus intended to be
# started from the filtered alignment. james, sept 2021
inputs:
- name: open
metadata: "s3://nextstrain-data/files/ncov/open/metadata.tsv.gz"
sequences: "s3://nextstrain-data/files/ncov/open/sequences.fasta.xz"
# aligned: "s3://nextstrain-data/files/ncov/open/aligned.fasta.xz"
# masked: "s3://nextstrain-data/files/ncov/open/masked.fasta.xz"
# filtered: "s3://nextstrain-data/files/ncov/open/filtered.fasta.xz"
filtered: "s3://nextstrain-data/files/ncov/open/filtered.fasta.xz"

# Define locations for which builds should be created.
# For each build we specify a subsampling scheme via an explicit key.
Expand Down
9 changes: 9 additions & 0 deletions workflow/schemas/config.schema.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -61,3 +61,12 @@ properties:
type: string
# A similar pattern is used in the workflow's wildcard constraints.
pattern: "^[a-zA-Z0-9-]+$"

upload:
type: array
minItems: 1
items:
type: string
enum:
- preprocessing-files
- build-files
16 changes: 13 additions & 3 deletions workflow/snakemake_rules/common.smk
Original file line number Diff line number Diff line change
Expand Up @@ -152,15 +152,16 @@ def _get_upload_inputs(wildcards):
origin = config["S3_DST_ORIGINS"][0]

# mapping of remote → local filenames
uploads = {
preprocessing_files = {
f"aligned.fasta.xz": f"results/aligned_{origin}.fasta.xz", # from `rule align`
f"masked.fasta.xz": f"results/masked_{origin}.fasta.xz", # from `rule mask`
f"filtered.fasta.xz": f"results/filtered_{origin}.fasta.xz", # from `rule filter`
f"mutation-summary.tsv.xz": f"results/mutation_summary_{origin}.tsv.xz", # from `rule mutation_summary`
}

build_files = {}
for build_name in config["builds"]:
uploads.update({
build_files.update({
f"{build_name}/sequences.fasta.xz": f"results/{build_name}/{build_name}_subsampled_sequences.fasta.xz", # from `rule combine_samples`
f"{build_name}/metadata.tsv.xz": f"results/{build_name}/{build_name}_subsampled_metadata.tsv.xz", # from `rule combine_samples`
f"{build_name}/aligned.fasta.xz": f"results/{build_name}/aligned.fasta.xz", # from `rule build_align`
Expand All @@ -170,4 +171,13 @@ def _get_upload_inputs(wildcards):
f"{build_name}/{build_name}_root-sequence.json": f"auspice/{config['auspice_json_prefix']}_{build_name}_root-sequence.json"
})

return uploads
req_upload = config.get("upload", [])
if "preprocessing-files" in req_upload and "build-files" in req_upload:
return {**preprocessing_files, **build_files}
elif "preprocessing-files" in req_upload:
return preprocessing_files
elif "build-files" in req_upload:
return build_files
else:
raise Exception("The upload rule requires an 'upload' parameter in the config.")

5 changes: 3 additions & 2 deletions workflow/snakemake_rules/export_for_nextstrain.smk
Original file line number Diff line number Diff line change
Expand Up @@ -146,6 +146,7 @@ from os import environ

SLACK_TOKEN = environ["SLACK_TOKEN"] = config["slack_token"] or ""
SLACK_CHANNEL = environ["SLACK_CHANNEL"] = config["slack_channel"] or ""
BUILD_DESCRIPTION = f"Build to upload {' & '.join(config.get('upload', ['nothing']))}"

try:
deploy_origin = (
Expand Down Expand Up @@ -197,7 +198,7 @@ rule upload:
shell("./scripts/upload-to-s3 {local:q} s3://{params.s3_bucket:q}/{remote:q} | tee -a {log:q}")

onstart:
slack_message = f"Build {deploy_origin} started."
slack_message = f"{BUILD_DESCRIPTION} {deploy_origin} started."

if SLACK_TOKEN and SLACK_CHANNEL:
shell(f"""
Expand All @@ -210,7 +211,7 @@ onstart:
""")

onerror:
slack_message = f"Build {deploy_origin} failed."
slack_message = f"{BUILD_DESCRIPTION} {deploy_origin} failed."

if SLACK_TOKEN and SLACK_CHANNEL:
shell(f"""
Expand Down

0 comments on commit c4934d9

Please sign in to comment.