Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update nextclade rules in ingest [#21] #32

Merged
merged 4 commits into from
Jan 9, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
25 changes: 24 additions & 1 deletion ingest/defaults/config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -131,5 +131,28 @@ curate:
- url
nextclade:
dataset_name: "nextstrain/yellow-fever/prM-E"
field_map: "defaults/nextclade_field_map.tsv"
field_map:
seqName: "seqName"
clade: "clade"
coverage: "coverage"
totalMissing: "missing_data"
totalSubstitutions: "divergence"
totalNonACGTNs: "nonACGTN"
qc.overallStatus: "QC_overall"
qc.missingData.status: "QC_missing_data"
qc.mixedSites.status: "QC_mixed_sites"
qc.privateMutations.status: "QC_rare_mutations"
qc.snpClusters.status: "QC_snp_clusters"
qc.frameShifts.status: "QC_frame_shifts"
qc.stopCodons.status: "QC_stop_codons"
frameShifts: "frame_shifts"
privateNucMutations.reversionSubstitutions: "private_reversion_substitutions"
privateNucMutations.labeledSubstitutions: "private_labeled_substitutions"
privateNucMutations.unlabeledSubstitutions: "private_unlabeled_substitutions"
privateNucMutations.totalReversionSubstitutions: "private_total_reversion_substitutions"
privateNucMutations.totalLabeledSubstitutions: "private_total_labeled_substitutions"
privateNucMutations.totalUnlabeledSubstitutions: "private_total_unlabeled_substitutions"
privateNucMutations.totalPrivateSubstitutions: "private_total_private_substitutions"
qc.snpClusters.clusteredSNPs: "private_snp_clusters"
qc.snpClusters.totalSNPs: "private_total_snp_clusters"
id_field: "seqName"
28 changes: 0 additions & 28 deletions ingest/defaults/nextclade_field_map.tsv

This file was deleted.

2 changes: 1 addition & 1 deletion ingest/rules/curate.smk
Original file line number Diff line number Diff line change
Expand Up @@ -118,7 +118,7 @@ rule curate:

rule add_genbank_url:
input:
metadata=temp("data/all_metadata_intermediate.tsv"),
metadata="data/all_metadata_intermediate.tsv",
output:
metadata="data/all_metadata.tsv",
log:
Expand Down
59 changes: 36 additions & 23 deletions ingest/rules/nextclade.smk
Original file line number Diff line number Diff line change
Expand Up @@ -45,11 +45,35 @@ rule run_nextclade:
"""


rule join_metadata_and_nextclade:
rule nextclade_metadata:
input:
nextclade="results/nextclade.tsv",
output:
nextclade_metadata=temp("results/nextclade_metadata.tsv"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason to mark this as temp? Would it not be useful for debugging along with all the other intermediate files in results/?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's marked as temp because the next rule (join_metadata_and_nextclade) merges this file and the other metadata (data/subset_metadata.tsv) together, so all the data ends up in that file anyway. Any debugging would most likely happen using that file.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we want to debug augur curate rename in this rule and join_metadata_and_nextclade fails or we simply don't want to run it?

My argument is mostly based on consistency, since the same could be said about results/nextclade.tsv: why not make that temp because it eventually makes its way into results/metadata.tsv?

Consider this non-blocking since it works fine.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we want to debug augur curate rename in this rule and join_metadata_and_nextclade fails or we simply don't want to run it?

In such a situation, Snakemake's --notemp option could be used.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My argument is mostly based on consistency, since the same could be said about results/nextclade.tsv: why not make that temp because it eventually makes its way into results/metadata.tsv?

results/nextclade.tsv and results/nextclade_metadata.tsv are effectively isotopes of the same underlying data; we don't need to retain both of them, because if you have one, you can tell what had to have been in the other one. I marked the latter one as temp because it's the newer of the two, so it was less change.

params:
nextclade_id_field=config["nextclade"]["id_field"],
nextclade_field_map=[f"{old}={new}" for old, new in config["nextclade"]["field_map"].items()],
nextclade_fields=",".join(config["nextclade"]["field_map"].values()),
log:
"logs/nextclade_metadata.txt",
benchmark:
"benchmarks/nextclade_metadata.tsv",
shell:
r"""
augur curate rename \
--metadata {input.nextclade:q} \
--id-column {params.nextclade_id_field:q} \
--field-map {params.nextclade_field_map:q} \
--output-metadata - \
| csvtk cut --tabs --fields {params.nextclade_fields:q} \
> {output.nextclade_metadata:q} 2> {log:q}
"""


rule join_metadata_and_nextclade:
input:
metadata="data/subset_metadata.tsv",
nextclade_field_map=config["nextclade"]["field_map"],
nextclade_metadata="results/nextclade_metadata.tsv",
output:
metadata="results/metadata.tsv",
params:
Expand All @@ -61,25 +85,14 @@ rule join_metadata_and_nextclade:
"benchmarks/join_metadata_and_nextclade.txt",
shell:
r"""
(
export SUBSET_FIELDS=`grep -v '^#' {input.nextclade_field_map} | awk '{{print $1}}' | tr '\n' ',' | sed 's/,$//g'`

csvtk -t cut -f $SUBSET_FIELDS \
{input.nextclade} \
| csvtk -t rename2 \
-F \
-f '*' \
-p '(.+)' \
-r '{{kv}}' \
-k {input.nextclade_field_map} \
| tsv-join -H \
--filter-file - \
--key-fields {params.nextclade_id_field} \
--data-fields {params.metadata_id_field} \
--append-fields '*' \
--write-all ? \
{input.metadata} \
| tsv-select -H --exclude {params.nextclade_id_field} \
> {output.metadata}
) 2>{log:q}
augur merge \
--metadata \
metadata={input.metadata:q} \
nextclade={input.nextclade_metadata:q} \
--metadata-id-columns \
metadata={params.metadata_id_field:q} \
nextclade={params.nextclade_id_field:q} \
--output-metadata {output.metadata:q} \
--no-source-columns \
&> {log:q}
"""
Loading