Incorporate use of datasets-cli into genome prepare pipeline #375

ens-LCampbell · 2024-05-22T16:12:03Z

Aim::
Make use of container with datasets-cli tool to pull both metadata and data file from NCBI.

Changes:

Addition of bespoke nextflow module (download_asm_with_datasets) process to work with datasets-cli, retained old module process (download_asm_data) for now.
Update nextflow config to make use of container in process with use of withlabel: 'datasets_cli'
Various updates to genome_metadata/extend and seq_region/prepare modules to align with change in assembly report meta keys
Added the latest (ok it was, now is semi latest 16.17.1) definition file for datasets-cli SIF creation
Update the test suite of genome_prepare pipeline, including data differences and MD5 checksums

@MatBarba I have tested this over, if you can think of anything else let me know thanks

inputs with and without annotation
with and without brc_mode enabled
using '-stub'

Update lcampbell/datasets_gbff with change in main

JAlvarezJarreta

Great improvement to GenomIO 👍 Good to see datasets is in good shape to be used for production.

Since v16.17.3 is already out, I'm suggesting we use the latest to avoid having to do another update at a later stage (the definition file will need to be renamed as well).

containers/ncbi_datasets_v16.17.1.def

pipelines/nextflow/modules/download/download_asm_with_datasets.nf

pipelines/nextflow/subworkflows/genome_prepare/main.nf

pipelines/nextflow/workflows/nextflow.config

Update to latest version of datasets Co-authored-by: J. Alvarez-Jarreta <jalvarez@ebi.ac.uk>

Add indents to process commands Co-authored-by: J. Alvarez-Jarreta <jalvarez@ebi.ac.uk>

add indents to process commands Co-authored-by: J. Alvarez-Jarreta <jalvarez@ebi.ac.uk>

fix typo to nextflow Module name Co-authored-by: J. Alvarez-Jarreta <jalvarez@ebi.ac.uk>

MatBarba

One important typo, and need to check the output of seq_region.json (the new version includes some empty values, and has seemingly lost another)

pipelines/nextflow/modules/download/download_asm_with_datasets.nf

MatBarba · 2024-05-23T17:07:05Z

pipelines/nextflow/tests/workflows/test_genome_prepare.yml

-      md5sum: 28518b0c7cbc19a2890a6b347367a82f
+      md5sum: 6a45dc461c53e33dde33807c6def7b63


The content of the file has significantly changed. Here's a diff:

"synonyms": [ "synonyms": [ { { "name": "CM029948.1", "name": "CM029948.1", "source": "GenBank" "source": "GenBank" }, }, { { "name": "Mitochondrion", | "name": "MT", "source": "INSDC" "source": "INSDC" }, }, { { "name": "", | "name": "HcG217B07", "source": "INSDC_submitted_name" "source": "INSDC_submitted_name" }, < { < "name": "", < "source": "RefSeq" < } }

MatBarba · 2024-05-23T17:19:14Z

Another general comment: we're generating an assembly report file derived from the jsonl file, but it has different attributes than the usual one that we get from ftp. Then would it not make more sense to parse the json file directly (and leave the current seq_region/extend.py as deprecated/legacy to parse assembly report from ftp)?

Fix another typo to nxf process name Co-authored-by: Matthieu Barba <mbarba@ebi.ac.uk>

ens-LCampbell · 2024-05-24T09:39:31Z

Another general comment: we're generating an assembly report file derived from the jsonl file, but it has different attributes than the usual one that we get from ftp. Then would it not make more sense to parse the json file directly (and leave the current seq_region/extend.py as deprecated/legacy to parse assembly report from ftp)?

This would make more sense Indeed ! Basically remove the extra step of converting to TSV from seq_region.jsonl. However it does work albeit slightly extra processing overhead, the issue is loss of INSDC_submitted_name in the datasets world. If Im not mistake, this is incoming soonish is that right @JAlvarezJarreta ?

JAlvarezJarreta · 2024-05-24T10:05:40Z

Yes, although Nuala has not provided a timeline so if this elements is critical for our system we may refrain from merging this PR until that time.

We have also to be careful about legacy elements and where they sit (although this one, whilst legacy, its presence is useful and justifiable).

ens-LCampbell and others added 5 commits May 22, 2024 14:43

Inital incorporation of datasets into genome_prepare

e3a2f61

Update ncbi cli singularity Def file v16.17.1

d58786a

Merge pull request #374 from Ensembl/main

3e531c9

Update lcampbell/datasets_gbff with change in main

Fix prepare seq region in brc mode with datasets-cli

698da1b

Update genome_prepare nxf test suite

db3d2ae

ens-LCampbell requested review from JAlvarezJarreta and MatBarba May 22, 2024 16:12

Blacked seq_region prepare, fixed pytest test+data

5ff99af

JAlvarezJarreta requested changes May 23, 2024

View reviewed changes

ens-LCampbell and others added 6 commits May 23, 2024 11:36

Update containers/ncbi_datasets_v16.17.1.def

ecf8c87

Update to latest version of datasets Co-authored-by: J. Alvarez-Jarreta <jalvarez@ebi.ac.uk>

Update containers/ncbi_datasets_v16.17.1.def

60a2cda

Update to latest version of datasets Co-authored-by: J. Alvarez-Jarreta <jalvarez@ebi.ac.uk>

Update pipelines/nextflow/modules/download/download_asm_with_datasets.nf

5a3a41e

Add indents to process commands Co-authored-by: J. Alvarez-Jarreta <jalvarez@ebi.ac.uk>

Update pipelines/nextflow/modules/download/download_asm_with_datasets.nf

8df793a

add indents to process commands Co-authored-by: J. Alvarez-Jarreta <jalvarez@ebi.ac.uk>

Update pipelines/nextflow/subworkflows/genome_prepare/main.nf

8a3a14e

fix typo to nextflow Module name Co-authored-by: J. Alvarez-Jarreta <jalvarez@ebi.ac.uk>

Update pipelines/nextflow/subworkflows/genome_prepare/main.nf

7eb2044

fix typo to nextflow Module name Co-authored-by: J. Alvarez-Jarreta <jalvarez@ebi.ac.uk>

ens-LCampbell requested a review from JAlvarezJarreta May 23, 2024 10:46

Update container version file name to 16.7.3

f0b97be

JAlvarezJarreta approved these changes May 23, 2024

View reviewed changes

MatBarba requested changes May 23, 2024

View reviewed changes

Update pipelines/nextflow/modules/download/download_asm_with_datasets.nf

481ce8c

Fix another typo to nxf process name Co-authored-by: Matthieu Barba <mbarba@ebi.ac.uk>

Update datasets docker endpoint to ensemblorg

3889e89

ens-LCampbell closed this Jul 4, 2024

ens-LCampbell deleted the lcampbell/datasets_gbff branch October 7, 2024 13:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incorporate use of datasets-cli into genome prepare pipeline #375

Incorporate use of datasets-cli into genome prepare pipeline #375

ens-LCampbell commented May 22, 2024

JAlvarezJarreta left a comment

MatBarba left a comment

MatBarba May 23, 2024

MatBarba commented May 23, 2024

ens-LCampbell commented May 24, 2024

JAlvarezJarreta commented May 24, 2024

		md5sum: 28518b0c7cbc19a2890a6b347367a82f
		md5sum: 6a45dc461c53e33dde33807c6def7b63

Incorporate use of datasets-cli into genome prepare pipeline #375

Incorporate use of datasets-cli into genome prepare pipeline #375

Conversation

ens-LCampbell commented May 22, 2024

JAlvarezJarreta left a comment

Choose a reason for hiding this comment

MatBarba left a comment

Choose a reason for hiding this comment

MatBarba May 23, 2024

Choose a reason for hiding this comment

MatBarba commented May 23, 2024

ens-LCampbell commented May 24, 2024

JAlvarezJarreta commented May 24, 2024