Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorporate use of datasets-cli into genome prepare pipeline #375

Closed
wants to merge 15 commits into from

Conversation

ens-LCampbell
Copy link
Member

Aim::
Make use of container with datasets-cli tool to pull both metadata and data file from NCBI.

Changes:

  • Addition of bespoke nextflow module (download_asm_with_datasets) process to work with datasets-cli, retained old module process (download_asm_data) for now.
  • Update nextflow config to make use of container in process with use of withlabel: 'datasets_cli'
  • Various updates to genome_metadata/extend and seq_region/prepare modules to align with change in assembly report meta keys
  • Added the latest (ok it was, now is semi latest 16.17.1) definition file for datasets-cli SIF creation
  • Update the test suite of genome_prepare pipeline, including data differences and MD5 checksums

@MatBarba I have tested this over, if you can think of anything else let me know thanks

  • inputs with and without annotation
  • with and without brc_mode enabled
  • using '-stub'

Copy link
Contributor

@JAlvarezJarreta JAlvarezJarreta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great improvement to GenomIO 👍 Good to see datasets is in good shape to be used for production.

Since v16.17.3 is already out, I'm suggesting we use the latest to avoid having to do another update at a later stage (the definition file will need to be renamed as well).

containers/ncbi_datasets_v16.17.1.def Outdated Show resolved Hide resolved
containers/ncbi_datasets_v16.17.1.def Outdated Show resolved Hide resolved
pipelines/nextflow/subworkflows/genome_prepare/main.nf Outdated Show resolved Hide resolved
pipelines/nextflow/subworkflows/genome_prepare/main.nf Outdated Show resolved Hide resolved
pipelines/nextflow/workflows/nextflow.config Outdated Show resolved Hide resolved
ens-LCampbell and others added 6 commits May 23, 2024 11:36
Update to latest version of datasets

Co-authored-by: J. Alvarez-Jarreta <jalvarez@ebi.ac.uk>
Update to latest version of datasets

Co-authored-by: J. Alvarez-Jarreta <jalvarez@ebi.ac.uk>
Add indents to process commands

Co-authored-by: J. Alvarez-Jarreta <jalvarez@ebi.ac.uk>
add indents to process commands

Co-authored-by: J. Alvarez-Jarreta <jalvarez@ebi.ac.uk>
fix typo to nextflow Module name

Co-authored-by: J. Alvarez-Jarreta <jalvarez@ebi.ac.uk>
fix typo to nextflow Module name

Co-authored-by: J. Alvarez-Jarreta <jalvarez@ebi.ac.uk>
Copy link
Contributor

@MatBarba MatBarba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One important typo, and need to check the output of seq_region.json (the new version includes some empty values, and has seemingly lost another)

md5sum: 28518b0c7cbc19a2890a6b347367a82f
md5sum: 6a45dc461c53e33dde33807c6def7b63
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The content of the file has significantly changed. Here's a diff:

        "synonyms": [                                                   "synonyms": [
            {                                                               {
                "name": "CM029948.1",                                           "name": "CM029948.1",
                "source": "GenBank"                                             "source": "GenBank"
            },                                                              },
            {                                                               {
                "name": "Mitochondrion",                      |                 "name": "MT",
                "source": "INSDC"                                               "source": "INSDC"
            },                                                              },
            {                                                               {
                "name": "",                                   |                 "name": "HcG217B07",
                "source": "INSDC_submitted_name"                                "source": "INSDC_submitted_name"
            },                                                <
            {                                                 <
                "name": "",                                   <
                "source": "RefSeq"                            <
            }                                                               }

@MatBarba
Copy link
Contributor

Another general comment: we're generating an assembly report file derived from the jsonl file, but it has different attributes than the usual one that we get from ftp. Then would it not make more sense to parse the json file directly (and leave the current seq_region/extend.py as deprecated/legacy to parse assembly report from ftp)?

Fix another typo to nxf process name

Co-authored-by: Matthieu Barba <mbarba@ebi.ac.uk>
@ens-LCampbell
Copy link
Member Author

Another general comment: we're generating an assembly report file derived from the jsonl file, but it has different attributes than the usual one that we get from ftp. Then would it not make more sense to parse the json file directly (and leave the current seq_region/extend.py as deprecated/legacy to parse assembly report from ftp)?

This would make more sense Indeed ! Basically remove the extra step of converting to TSV from seq_region.jsonl. However it does work albeit slightly extra processing overhead, the issue is loss of INSDC_submitted_name in the datasets world. If Im not mistake, this is incoming soonish is that right @JAlvarezJarreta ?

@JAlvarezJarreta
Copy link
Contributor

Yes, although Nuala has not provided a timeline so if this elements is critical for our system we may refrain from merging this PR until that time.

We have also to be careful about legacy elements and where they sit (although this one, whilst legacy, its presence is useful and justifiable).

@ens-LCampbell ens-LCampbell deleted the lcampbell/datasets_gbff branch October 7, 2024 13:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants