-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Analyze whether checksum robot steps can be made more efficient (i.e. take less time to run) #1074
Comments
One more bit of context: there are checksum steps in the preservationIngestWF. These tend to run more quickly than |
|
We think we've done sufficient analysis and made one key improvement in #1100 so we're going to say this is done, other than @andrewjbtw and @ndushay chatting about whether there's work left to be done as part of |
sdr-ingest transfer invokes DSA preserve endpoint. That in turn calls the PreserveJob, which calls PreservationIngestService.transfer and then starts the preservation workflow. PreservationIngestService creates the bag. I don't believe that fixities are generated or checked in this step, but it is quite possible that some process in creating the bag is slow. |
Somewhere between sdr-ingest-transfer (in accessionWF) and transfer-object (in preservationIngestWF), the following things happen:
From my observation of accessioning progress, a job called "PreserveJob" runs and is visible in the Sidekiq UI for DSA. However, this job appears to complete before preservationIngestWF is initiated. When sdr-ingest-transfer is finished, the "duration" is set to 1 second. It definitely does not take 1 second. For example, if you look at the timestamps in the XML, 9 days elapsed between the publish step and the completion of sdr-ingest-transfer on this druid: https://argo.stanford.edu/view/druid:sz929gx7593 It is possible that the names of things make it confusing to determine what each step does around the preservation handoff. I can only speak to what it looks like in the workflow view. |
Additionally, fixities are definitely checked at sdr-ingest-transfer because occasional fixity errors have been reported at that step. I can probably deliberately make that happen by corrupting a file. |
Example of a checksum validation failure at sdr-ingest-transfer. I deliberately changed
|
Thanks for all of the details @andrewjbtw. Based on what you reported, the culprit is the unhelpfully named normalized_signature. However, it calls FileSignature.from_file which is optimized for a single read. The sdr-transfer / bagger code is a warren, so possible that there is some other inefficiency. |
I think we may want to have a storytime to come to shared understanding of the full process from preassembly through final accessioning, where checksums are computed and under which conditions, and whether we can reduce the number of time we do it. |
I am closing this ticket in favor of sul-dlss/dor-services-app#4594. Andrew has indicated he is unaware of any other efficiencies we can chase. |
Background
There are two steps in the SDR accessioning workflows that can take an extraordinary amount of time when processing large (multi-TB) content. By extraordinary, I mean 1-5 days for just the one robot step. Both steps are steps that check and/or calculate checksums:
checksum-compute
in the assemblyWFsdr-ingest-transfer
in the accessionWFAnalysis requested
This ticket is for analysis and investigation into whether these steps are running as efficiently as they can given our hardware and network bandwidth constraints, or whether they can be sped up.
I'm very curious whether these steps are reading each file in a druid more than once. For example, the
checksum-compute
step may carry out multiple tasks:If the robot step is reading each file more than once, would it be possible to modify the step to carry out only one read? There's no way to generate checksums with less than one full read of each file, but it may not be necessary to read more than once in order to carry out the step.
Some evidence that the
checksum-compute
step may be reading files multiple timesI've been monitoring the progress of
checksum-compute
on a druid with 5 TB of data. The tool I've been using islsof
, which lists open files. By grepping for the druid, I get a list of files that are being processed that are associated with that druid.$ lsof -o | grep sz929gx7593 ruby 1900060 lyberadmin 9r REG 0,57 0x99d4fb4000 161997549 /dor/assembly/sz/929/gx/7593/sz929gx7593/content/sz929gx7593_5_pm.mkv (sf6-webapp:/sul_lyberservices_prod_01) ruby 1900060 1901089 Timeout lyberadmin 9r REG 0,57 0x99d4f90000 161997549 /dor/assembly/sz/929/gx/7593/sz929gx7593/content/sz929gx7593_5_pm.mkv (sf6-webapp:/sul_lyberservices_prod_01) ruby 1900060 1901090 sidekiq.h lyberadmin 9r REG 0,57 0x99d4f98000 161997549 /dor/assembly/sz/929/gx/7593/sz929gx7593/content/sz929gx7593_5_pm.mkv (sf6-webapp:/sul_lyberservices_prod_01) ruby 1900060 1901093 sidekiq.d lyberadmin 9r REG 0,57 0x99d4f9c000 161997549 /dor/assembly/sz/929/gx/7593/sz929gx7593/content/sz929gx7593_5_pm.mkv (sf6-webapp:/sul_lyberservices_prod_01) ruby 1900060 1901132 worker.rb lyberadmin 9r REG 0,57 0x99d4fa8000 161997549 /dor/assembly/sz/929/gx/7593/sz929gx7593/content/sz929gx7593_5_pm.mkv (sf6-webapp:/sul_lyberservices_prod_01) ruby 1900060 2647395 launcher. lyberadmin 9r REG 0,57 0x99d4fb0000 161997549 /dor/assembly/sz/929/gx/7593/sz929gx7593/content/sz929gx7593_5_pm.mkv (sf6-webapp:/sul_lyberservices_prod_01) screen 3892847 lyberadmin cwd DIR 0,57 152134624 /dor/assembly/sz/929/gx/7593/sz929gx7593/content (sf6-webapp:/sul_lyberservices_prod_01)
The
lsof -o
option outputs the byte offset for the opened file, which I'm interpreting as an indicator of how much of the file has been read so far. Repeatedly runninglsof -o
is a way to track progress because you can see the byte offset increase until the whole file is read.I wrote a bit of bash to translate the file offset into GB, so it would be easier to monitor progress.
This outputs one line every five minutes with the number of GB read and the filepath of the currently open file:
After monitoring accessioning for a couple of days, I've noticed from the output of this script that every file appears to be read twice. The byte offset increases until the full size of the file is reached, then starts over again at 0.
If this is accurate, it suggests that there could be significant time savings if we could reduce the number of reads to a single pass.
The text was updated successfully, but these errors were encountered: