-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MRG: report checksum file download failures #92
Conversation
@ccbaumler do you have an accession where the cc @ctb |
looking at the log messages, the checksum file that is being returned is almost certainly some kind of HTML error message. Hmm, I wonder - are yoweu checking to see if the HTTP status is 200? I would expect the NCBI server to return a non-200 in the case where it's failing for any reason. That might be a simple and easy way to determine if you're getting the expected file... |
No, I think your intuition is spot on. An example set from the log file generated during the linked issue:
@ctb had pointed out that
|
Ah, good point @ctb - e883aad adds a check for the response status for the md5sum file. I modified the checksum output file to contain the full error. This will provide both the calculated and expected md5sums in the event we have a mismatch. This output file cannot be read in as input to I've added the failed checksum as optional output to urlsketch, which allows me to test the output format by using the mismatched md5sum case. |
ok @ctb this is ready for review! |
thanks @bluegenes - have you had a chance to run it on something that generates the errors? Is this something @ccbaumler could/should do before merge? Or would you prefer merge on the code review and then see what happens with the release? I could build a wheel for @ccbaumler to install and run out on farm, if you want someone to try "at scale", which is where I think this error shows up. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall this looks great! I don't quite understand (on a first pass) exactly how you are testing this, but the tests are all there so 👍
thanks! I mean, unfortunately I am not completely testing it within I have not run it at scale -- would be great if @ccbaumler wanted to give it a run. |
@ccbaumler if you wanted to give this a try, the release wheel is at farm:
and then go about your normal business of running directsketch. |
The script works as expected. I do not know how to force a checksum error. Any suggestions? TestGet genbank data
Get sourmash manifest
Make a csv file for directsketch
Run this branch version of directsketch
Check zip file output
|
@ctb @ccbaumler -- any objections to me merging this one? |
go for it - but if you could create an issue saying that we should try forcing a checksum error somehow to test this better, I'd appreciate it ;) |
gbsketch
does not report sequences failed to sketch if checksum fails #70This PR reports any failures brought about by issues downloading or parsing the md5sum file within
gbsketch
. These cases were not previously included in the failures file. This is the issue in #70.Now that we have a separate output file, I thought it might be useful to also pull out cases where the md5sum does not match expectation, and write those to the md5sum failure file as well (added in 6755296). These were previously just written to the download failures file, with no explanation for why the download failed (e.g. whether it was a download issue or an md5sum matching issue).
There is no straightforward way to test the checksum failures file via
gbsketch
, but we use the mismatch md5sum case to test it viaurlsketch
.There is a "major" change this PR brings: the
--checksum-fail
file is now required forgbsketch
. I think it's not possible to make this optional with the current strategy, because otherwise the errors for failed md5sum file download/parsing would not be properly captured anywhere. I have made--checksum-fail
optional forurlsketch
, because there folks need to manually provide expected md5sums, so we don't have md5sum file download issues and often we won't even have md5sums to check.