Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MRG: report checksum file download failures #92

Merged
merged 11 commits into from
Sep 30, 2024

Conversation

bluegenes
Copy link
Collaborator

@bluegenes bluegenes commented Sep 6, 2024

This PR reports any failures brought about by issues downloading or parsing the md5sum file within gbsketch. These cases were not previously included in the failures file. This is the issue in #70.

Now that we have a separate output file, I thought it might be useful to also pull out cases where the md5sum does not match expectation, and write those to the md5sum failure file as well (added in 6755296). These were previously just written to the download failures file, with no explanation for why the download failed (e.g. whether it was a download issue or an md5sum matching issue).

There is no straightforward way to test the checksum failures file via gbsketch, but we use the mismatch md5sum case to test it via urlsketch.

There is a "major" change this PR brings: the --checksum-fail file is now required for gbsketch. I think it's not possible to make this optional with the current strategy, because otherwise the errors for failed md5sum file download/parsing would not be properly captured anywhere. I have made --checksum-fail optional for urlsketch, because there folks need to manually provide expected md5sums, so we don't have md5sum file download issues and often we won't even have md5sums to check.

@bluegenes
Copy link
Collaborator Author

@ccbaumler do you have an accession where the md5sum file download or parsing definitely fails? My assumption is that the majority of the issues here happened when NCBI was updating files (and so shouldn't happen most of the time), but I'd like to figure out a way to test the failed checksum file output.

cc @ctb

@ctb
Copy link
Contributor

ctb commented Sep 7, 2024

looking at the log messages, the checksum file that is being returned is almost certainly some kind of HTML error message.

Hmm, I wonder - are yoweu checking to see if the HTTP status is 200? I would expect the NCBI server to return a non-200 in the case where it's failing for any reason. That might be a simple and easy way to determine if you're getting the expected file...

@ccbaumler
Copy link

ccbaumler commented Sep 7, 2024

do you have an accession where the md5sum file download or parsing definitely fails?

No, I think your intuition is spot on.

An example set from the log file generated during the linked issue:

^MESC[K
== This is sourmash version 4.8.9. ==
^MESC[K== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

^MESC[K=> sourmash_plugin_directsketch 0.3.2
^MESC[Kparams: ['dna,k=21,k=31,k=51,scaled=1000,abund']
^MESC[Kdownloading and sketching all accessions in '/group/ctbrowngrp4/2024-ccbaumler-genbank/genbank-20240712/data/missing-genomes.20240712-viral.csv using 5 retries and 4 threads
No protein signature templates provided, and --keep-fasta is not set.
Downloading and sketching genomes only.
Error: Invalid checksum line format in URL https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/023/535/875/GCA_023535875.1_ASM2353587v1/md5checksums.txt: <head>
Error: Invalid checksum line format in URL https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/023/533/615/GCA_023533615.1_ASM2353361v1/md5checksums.txt: <head>
Error: Invalid checksum line format in URL https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/023/533/665/GCA_023533665.1_ASM2353366v1/md5checksums.txt: <head>

@ctb had pointed out that <head> was most likely from a html return stating either:

  • ncbi is down (which happened at least twice recently)
  • I over-requested and ncbi refused service :(

@bluegenes
Copy link
Collaborator Author

Ah, good point @ctb - e883aad adds a check for the response status for the md5sum file.

I modified the checksum output file to contain the full error. This will provide both the calculated and expected md5sums in the event we have a mismatch. This output file cannot be read in as input to urlsketch like the download failures, but I think that's ok because we probably want to 1. try again to download the md5sum file, or 2. investigate what led to the md5sum mismatch.

I've added the failed checksum as optional output to urlsketch, which allows me to test the output format by using the mismatched md5sum case.

@bluegenes
Copy link
Collaborator Author

ok @ctb this is ready for review!

@bluegenes bluegenes changed the title WIP: report checksum download failures MRG: report checksum download failures Sep 7, 2024
@ctb
Copy link
Contributor

ctb commented Sep 7, 2024

thanks @bluegenes - have you had a chance to run it on something that generates the errors? Is this something @ccbaumler could/should do before merge? Or would you prefer merge on the code review and then see what happens with the release?

I could build a wheel for @ccbaumler to install and run out on farm, if you want someone to try "at scale", which is where I think this error shows up.

@bluegenes bluegenes changed the title MRG: report checksum download failures MRG: report checksum file download failures Sep 7, 2024
src/directsketch.rs Outdated Show resolved Hide resolved
Copy link
Contributor

@ctb ctb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this looks great! I don't quite understand (on a first pass) exactly how you are testing this, but the tests are all there so 👍

@bluegenes
Copy link
Collaborator Author

Overall this looks great! I don't quite understand (on a first pass) exactly how you are testing this, but the tests are all there so 👍

thanks! I mean, unfortunately I am not completely testing it within gbsketch, because I don't have examples where it will certainly fail. But the same functionality is used across both gbsketch and urlsketch for checksum failures, so the new struct + output is at least partly tested through the md5sum mismatch test I added for urlsketch.

I have not run it at scale -- would be great if @ccbaumler wanted to give it a run.

@ctb
Copy link
Contributor

ctb commented Sep 8, 2024

@ccbaumler if you wanted to give this a try, the release wheel is at farm:/group/datalabgrp/ctbrown/sourmash_plugin_directsketch/target/wheels/sourmash_plugin_directsketch-0.3.2-cp312-cp312-manylinux_2_34_x86_64.whl and you should be able to run something like:

module load mamba
mamba create -y -n directsketch_test sourmash python=3.12
mamba activate directsketch_test
pip install /group/datalabgrp/ctbrown/sourmash_plugin_directsketch/target/wheels/sourmash_plugin_directsketch-0.3.2-cp312-cp312-manylinux_2_34_x86_64.whl

and then go about your normal business of running directsketch.

@ccbaumler
Copy link

The script works as expected. I do not know how to force a checksum error. Any suggestions?

Test

Get genbank data

curl -L https://ftp.ncbi.nlm.nih.gov/genomes/genbank/viral/assembly_summary.txt > viral-assembly.txt
curl -L https://ftp.ncbi.nlm.nih.gov/genomes/genbank/viral/assembly_summary_historical.txt > viral-history.txt

Get sourmash manifest

sourmash sig manifest --no-rebuild /group/ctbrowngrp/sourmash-db/genbank-2022.03/genbank-2022.03-viral-k21.zip -o viral-mf.csv

Make a csv file for directsketch

$ ./scripts/update_sourmash_dbs.py viral-mf.csv -a viral-assembly.txt -b viral-history.txt --updated-version viral-update.csv --missing-genomes viral-missing.csv

Loading assembly summary from 'viral-assembly.txt'
Reading assembly file content...
Loaded 192613 identifiers with their individual comparison, gca_version, gcf_version, refseq_acc information

Loading historical summary from 'viral-history.txt'
Reading assembly file content...
Loaded 2031 identifiers with 2070 total versions across the identifiers

Trying to load Sourmash Manifest from viral-mf.csv...
Loaded manifest with 47951 rows

Writing link files...
Reading assembly summary file...
Found 145338 missing sequences...
Writing from GCA_list now...
Wrote 145338/145338 rows in GCA_list now...      38

...Wrote viral-missing.csv: Line 145338 of 145338
Expected 91 sequences to update
Actual found sequences to update: 91
GenBank (GCA) = 91 | RefSeq (GCF) = 0
Missing sequences? 0
Writing from GCA_list now...
Wrote 91/91 rows in GCA_list now...      1
Writing from GCF_list now...
Wrote 91/91 rows in GCF_list now...

...Wrote viral-update.csv: Line 91 of 91


From 'viral-assembly.txt:
Loaded 192613 identifiers
Each contains a unique comparison, gca_version, gcf_version, refseq_acc

From 'viral-history.txt':
Loaded 2031 identifiers
Loaded 2070 total versions across the identifiers

File assembly databases created on Mon Sep  9 10:41:38 2024
File assembly databases last modified Mon Sep  9 10:41:38 2024

From 'viral-mf.csv':
Kept 47184 of 47951 (98.40%) identifiers.

New manifest 'None':
Kept 47184 identifiers.
Included 145338 new genomes by new identifiers.
Removed 767 total identifiers.
Removed 91 identifiers because of version change.
Removed 676 identifiers because of suspected suspension of the genome.

Run this branch version of directsketch

$ sourmash scripts gbsketch viral-update.csv -o viral-test.zip --failed viral-fail.csv --checksum-fail viral-checksum.csv --param-string "dna,k=21,scaled=1000,abund" -r 5 -g 2> viral-log.txt
Loaded 91 rows (including 91 rows with valid URL).
Starting accession 1/91 (1%)
Starting accession 2/91 (2%)
Starting accession 3/91 (3%)
Starting accession 20/91 (22%)
Starting accession 21/91 (23%)
Starting accession 22/91 (24%)
Starting accession 23/91 (25%)

....

Starting accession 87/91 (96%)
Starting accession 88/91 (97%)
Starting accession 89/91 (98%)
Starting accession 90/91 (99%)
Starting accession 91/91 (100%)
Writing manifest

Check zip file output

$ sourmash sig summarize viral-test.zip

== This is sourmash version 4.8.11. ==
== Please cite Irber et. al (2024), doi:10.21105/joss.06830. ==

** loading from 'viral-test.zip'
path filetype: ZipFileLinearIndex
location: /home/baumlerc/database-releases/genbank-workflow/viral-test.zip
is database? yes
has manifest? yes
num signatures: 91
** examining manifest...
total hashes: 8441
summary of sketches:
   91 sketches with DNA, k=21, scaled=1000, abund     8441 total hashes

@bluegenes
Copy link
Collaborator Author

bluegenes commented Sep 30, 2024

@ctb @ccbaumler -- any objections to me merging this one?

@ctb
Copy link
Contributor

ctb commented Sep 30, 2024

go for it - but if you could create an issue saying that we should try forcing a checksum error somehow to test this better, I'd appreciate it ;)

@bluegenes bluegenes merged commit fce0368 into main Sep 30, 2024
1 check passed
@bluegenes bluegenes deleted the report-checksum-download-failures branch September 30, 2024 19:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

gbsketch does not report sequences failed to sketch if checksum fails
3 participants