-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incorporate use of datasets-cli into genome prepare pipeline #375
Conversation
Update lcampbell/datasets_gbff with change in main
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great improvement to GenomIO 👍 Good to see datasets
is in good shape to be used for production.
Since v16.17.3 is already out, I'm suggesting we use the latest to avoid having to do another update at a later stage (the definition file will need to be renamed as well).
pipelines/nextflow/modules/download/download_asm_with_datasets.nf
Outdated
Show resolved
Hide resolved
pipelines/nextflow/modules/download/download_asm_with_datasets.nf
Outdated
Show resolved
Hide resolved
Update to latest version of datasets Co-authored-by: J. Alvarez-Jarreta <jalvarez@ebi.ac.uk>
Update to latest version of datasets Co-authored-by: J. Alvarez-Jarreta <jalvarez@ebi.ac.uk>
Add indents to process commands Co-authored-by: J. Alvarez-Jarreta <jalvarez@ebi.ac.uk>
add indents to process commands Co-authored-by: J. Alvarez-Jarreta <jalvarez@ebi.ac.uk>
fix typo to nextflow Module name Co-authored-by: J. Alvarez-Jarreta <jalvarez@ebi.ac.uk>
fix typo to nextflow Module name Co-authored-by: J. Alvarez-Jarreta <jalvarez@ebi.ac.uk>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One important typo, and need to check the output of seq_region.json (the new version includes some empty values, and has seemingly lost another)
pipelines/nextflow/modules/download/download_asm_with_datasets.nf
Outdated
Show resolved
Hide resolved
md5sum: 28518b0c7cbc19a2890a6b347367a82f | ||
md5sum: 6a45dc461c53e33dde33807c6def7b63 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The content of the file has significantly changed. Here's a diff:
"synonyms": [ "synonyms": [
{ {
"name": "CM029948.1", "name": "CM029948.1",
"source": "GenBank" "source": "GenBank"
}, },
{ {
"name": "Mitochondrion", | "name": "MT",
"source": "INSDC" "source": "INSDC"
}, },
{ {
"name": "", | "name": "HcG217B07",
"source": "INSDC_submitted_name" "source": "INSDC_submitted_name"
}, <
{ <
"name": "", <
"source": "RefSeq" <
} }
Another general comment: we're generating an assembly report file derived from the jsonl file, but it has different attributes than the usual one that we get from ftp. Then would it not make more sense to parse the json file directly (and leave the current seq_region/extend.py as deprecated/legacy to parse assembly report from ftp)? |
Fix another typo to nxf process name Co-authored-by: Matthieu Barba <mbarba@ebi.ac.uk>
This would make more sense Indeed ! Basically remove the extra step of converting to TSV from seq_region.jsonl. However it does work albeit slightly extra processing overhead, the issue is loss of INSDC_submitted_name in the datasets world. If Im not mistake, this is incoming soonish is that right @JAlvarezJarreta ? |
Yes, although Nuala has not provided a timeline so if this elements is critical for our system we may refrain from merging this PR until that time. We have also to be careful about legacy elements and where they sit (although this one, whilst legacy, its presence is useful and justifiable). |
Aim::
Make use of container with datasets-cli tool to pull both metadata and data file from NCBI.
Changes:
@MatBarba I have tested this over, if you can think of anything else let me know thanks