Analyze whether checksum robot steps can be made more efficient (i.e. take less time to run) #1074

andrewjbtw · 2023-07-14T21:43:29Z

Background

There are two steps in the SDR accessioning workflows that can take an extraordinary amount of time when processing large (multi-TB) content. By extraordinary, I mean 1-5 days for just the one robot step. Both steps are steps that check and/or calculate checksums:

checksum-compute in the assemblyWF
sdr-ingest-transfer in the accessionWF

Analysis requested

This ticket is for analysis and investigation into whether these steps are running as efficiently as they can given our hardware and network bandwidth constraints, or whether they can be sped up.

I'm very curious whether these steps are reading each file in a druid more than once. For example, the checksum-compute step may carry out multiple tasks:

generate MD5 checksum for every file
generate SHA1 checksum for every file
verify user-supplied MD5 check for every file

If the robot step is reading each file more than once, would it be possible to modify the step to carry out only one read? There's no way to generate checksums with less than one full read of each file, but it may not be necessary to read more than once in order to carry out the step.

Some evidence that the checksum-compute step may be reading files multiple times

I've been monitoring the progress of checksum-compute on a druid with 5 TB of data. The tool I've been using is lsof, which lists open files. By grepping for the druid, I get a list of files that are being processed that are associated with that druid.

$ lsof -o | grep sz929gx7593
ruby      1900060                        lyberadmin    9r      REG               0,57 0x99d4fb4000 161997549 /dor/assembly/sz/929/gx/7593/sz929gx7593/content/sz929gx7593_5_pm.mkv (sf6-webapp:/sul_lyberservices_prod_01)
ruby      1900060 1901089 Timeout        lyberadmin    9r      REG               0,57 0x99d4f90000 161997549 /dor/assembly/sz/929/gx/7593/sz929gx7593/content/sz929gx7593_5_pm.mkv (sf6-webapp:/sul_lyberservices_prod_01)
ruby      1900060 1901090 sidekiq.h      lyberadmin    9r      REG               0,57 0x99d4f98000 161997549 /dor/assembly/sz/929/gx/7593/sz929gx7593/content/sz929gx7593_5_pm.mkv (sf6-webapp:/sul_lyberservices_prod_01)
ruby      1900060 1901093 sidekiq.d      lyberadmin    9r      REG               0,57 0x99d4f9c000 161997549 /dor/assembly/sz/929/gx/7593/sz929gx7593/content/sz929gx7593_5_pm.mkv (sf6-webapp:/sul_lyberservices_prod_01)
ruby      1900060 1901132 worker.rb      lyberadmin    9r      REG               0,57 0x99d4fa8000 161997549 /dor/assembly/sz/929/gx/7593/sz929gx7593/content/sz929gx7593_5_pm.mkv (sf6-webapp:/sul_lyberservices_prod_01)
ruby      1900060 2647395 launcher.      lyberadmin    9r      REG               0,57 0x99d4fb0000 161997549 /dor/assembly/sz/929/gx/7593/sz929gx7593/content/sz929gx7593_5_pm.mkv (sf6-webapp:/sul_lyberservices_prod_01)
screen    3892847                        lyberadmin  cwd       DIR               0,57              152134624 /dor/assembly/sz/929/gx/7593/sz929gx7593/content (sf6-webapp:/sul_lyberservices_prod_01)

The lsof -o option outputs the byte offset for the opened file, which I'm interpreting as an indicator of how much of the file has been read so far. Repeatedly running lsof -o is a way to track progress because you can see the byte offset increase until the whole file is read.

I wrote a bit of bash to translate the file offset into GB, so it would be easier to monitor progress.

while true
do 
	fileinfo=$(lsof -o | grep sz929gx7593 | head -n1 | tr -s ' ') # take the first line of lsof output and pull out just the file name and byte offset
	filepath=$(echo "$fileinfo" | cut -d ' ' -f9)
	bytecount=$(echo "$fileinfo" | cut -d ' ' -f7)
	echo "$(printf "%d\n" $bytecount) / 1073741824" | bc | tr '\n' ' ' # translate the hex to decimal, then convert bytes to GB
	echo "$filepath"
	sleep 300
done

This outputs one line every five minutes with the number of GB read and the filepath of the currently open file:

407 /dor/assembly/sz/929/gx/7593/sz929gx7593/content/sz929gx7593_4_pm.mkv
418 /dor/assembly/sz/929/gx/7593/sz929gx7593/content/sz929gx7593_4_pm.mkv
427 /dor/assembly/sz/929/gx/7593/sz929gx7593/content/sz929gx7593_4_pm.mkv

After monitoring accessioning for a couple of days, I've noticed from the output of this script that every file appears to be read twice. The byte offset increases until the full size of the file is reached, then starts over again at 0.

585 /dor/assembly/sz/929/gx/7593/sz929gx7593/content/sz929gx7593_4_pm.mkv
596 /dor/assembly/sz/929/gx/7593/sz929gx7593/content/sz929gx7593_4_pm.mkv 
0 /dor/assembly/sz/929/gx/7593/sz929gx7593/content/sz929gx7593_4_pm.mkv
7 /dor/assembly/sz/929/gx/7593/sz929gx7593/content/sz929gx7593_4_pm.mkv
17 /dor/assembly/sz/929/gx/7593/sz929gx7593/content/sz929gx7593_4_pm.mkv
...
614 /dor/assembly/sz/929/gx/7593/sz929gx7593/content/sz929gx7593_5_pm.mkv
630 /dor/assembly/sz/929/gx/7593/sz929gx7593/content/sz929gx7593_5_pm.mkv
13 /dor/assembly/sz/929/gx/7593/sz929gx7593/content/sz929gx7593_5_pm.mkv
24 /dor/assembly/sz/929/gx/7593/sz929gx7593/content/sz929gx7593_5_pm.mkv

If this is accurate, it suggests that there could be significant time savings if we could reduce the number of reads to a single pass.

The text was updated successfully, but these errors were encountered:

andrewjbtw · 2023-07-14T22:19:08Z

One more bit of context: there are checksum steps in the preservationIngestWF. These tend to run more quickly than checksum-compute and sdr-ingest-transfer on the same content. It's not clear if that's due to differences in hardware - they run after files have reached preservation storage - or implementation strategy, or both.

justinlittman · 2023-09-17T20:26:22Z

common accesioning checksum-compute uses Assembly::ObjectFile which performs individual digests (meaning the file is read multiple times).
pres cat checksum validator uses Moab::FileSignature which computes multiple digests from a single read.
pres robots validate bag uses DepositBagValidator which computes multiple digests from a single read.

refs #1074

mjgiarlo · 2023-09-18T20:19:09Z

We think we've done sufficient analysis and made one key improvement in #1100 so we're going to say this is done, other than @andrewjbtw and @ndushay chatting about whether there's work left to be done as part of sdr-ingest-transfer.

justinlittman · 2023-09-18T20:29:30Z

sdr-ingest transfer invokes DSA preserve endpoint. That in turn calls the PreserveJob, which calls PreservationIngestService.transfer and then starts the preservation workflow.

PreservationIngestService creates the bag. I don't believe that fixities are generated or checked in this step, but it is quite possible that some process in creating the bag is slow.

andrewjbtw · 2023-09-18T21:01:46Z

Somewhere between sdr-ingest-transfer (in accessionWF) and transfer-object (in preservationIngestWF), the following things happen:

checksums are validated (MD5, SHA1)
new checksum (sha256) is computed and added to Cocina
bag is created in /dor/export with checksum manifests for MD5, SHA1, and SHA256 and symlinks to the content

From my observation of accessioning progress, a job called "PreserveJob" runs and is visible in the Sidekiq UI for DSA. However, this job appears to complete before preservationIngestWF is initiated.

When sdr-ingest-transfer is finished, the "duration" is set to 1 second. It definitely does not take 1 second. For example, if you look at the timestamps in the XML, 9 days elapsed between the publish step and the completion of sdr-ingest-transfer on this druid: https://argo.stanford.edu/view/druid:sz929gx7593

It is possible that the names of things make it confusing to determine what each step does around the preservation handoff. I can only speak to what it looks like in the workflow view.

andrewjbtw · 2023-09-18T21:04:03Z

Additionally, fixities are definitely checked at sdr-ingest-transfer because occasional fixity errors have been reported at that step. I can probably deliberately make that happen by corrupting a file.

andrewjbtw · 2023-09-19T01:34:23Z

Example of a checksum validation failure at sdr-ingest-transfer. I deliberately changed file.txt after checksum-compute and before sdr-ingest-transfer:

problem with sdr-ingest-transfer (BackgroundJob: 678912): {:errors=>[{:title=>"Preservation error", :detail=>"Signature inconsistent between inventory and file for /dor/workspace/vy/213/xd/2348/vy213xd2348/content/file.txt: {"size"=>{:old=>13, :new=>"13"}, "md5"=>{:old=>"76f77e3250372614c9f9b0861f9e5ce6", :new=>"18519bfbd592b4e6cb238c5ccdbc209f"}, "sha1"=>{:old=>"d468a0236957a752c47bc89f86581da5e90a2b12", :new=>"8d905cdc9e9957824ab1c8f574733bf48a8ba46b"}, "sha256"=>{:old=>"f355877682a4d122a5a880b676a933f77558d70a1a46099d62e56c0dcd62fa32", :new=>nil}}"}]}

https://argo-stage.stanford.edu/view/druid:vy213xd2348

justinlittman · 2023-09-19T12:09:29Z

Thanks for all of the details @andrewjbtw. Based on what you reported, the culprit is the unhelpfully named normalized_signature. However, it calls FileSignature.from_file which is optimized for a single read.

The sdr-transfer / bagger code is a warren, so possible that there is some other inefficiency.

ndushay · 2023-09-19T17:06:52Z

I think we may want to have a storytime to come to shared understanding of the full process from preassembly through final accessioning, where checksums are computed and under which conditions, and whether we can reduce the number of time we do it.

ndushay · 2023-09-20T19:48:58Z

I am closing this ticket in favor of sul-dlss/dor-services-app#4594. Andrew has indicated he is unaware of any other efficiencies we can chase.

ndushay added needs analysis cannot proceed with this issue without analysis tech debt labels Sep 12, 2023

justinlittman added a commit that referenced this issue Sep 17, 2023

Change compute checksums to digest in single pass.

17735d4

refs #1074

justinlittman mentioned this issue Sep 17, 2023

Change compute checksums to digest in single pass. #1100

Merged

mjgiarlo assigned ndushay Sep 18, 2023

ndushay mentioned this issue Sep 20, 2023

Speed up the sdr-ingest-transfer step in the accessionWF sul-dlss/dor-services-app#4594

Open

ndushay closed this as completed Sep 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Analyze whether checksum robot steps can be made more efficient (i.e. take less time to run) #1074

Analyze whether checksum robot steps can be made more efficient (i.e. take less time to run) #1074

andrewjbtw commented Jul 14, 2023 •

edited

Loading

andrewjbtw commented Jul 14, 2023

justinlittman commented Sep 17, 2023

mjgiarlo commented Sep 18, 2023 •

edited

Loading

justinlittman commented Sep 18, 2023

andrewjbtw commented Sep 18, 2023 •

edited

Loading

andrewjbtw commented Sep 18, 2023

andrewjbtw commented Sep 19, 2023

justinlittman commented Sep 19, 2023

ndushay commented Sep 19, 2023

ndushay commented Sep 20, 2023

Analyze whether checksum robot steps can be made more efficient (i.e. take less time to run) #1074

Analyze whether checksum robot steps can be made more efficient (i.e. take less time to run) #1074

Comments

andrewjbtw commented Jul 14, 2023 • edited Loading

andrewjbtw commented Jul 14, 2023

justinlittman commented Sep 17, 2023

mjgiarlo commented Sep 18, 2023 • edited Loading

justinlittman commented Sep 18, 2023

andrewjbtw commented Sep 18, 2023 • edited Loading

andrewjbtw commented Sep 18, 2023

andrewjbtw commented Sep 19, 2023

justinlittman commented Sep 19, 2023

ndushay commented Sep 19, 2023

ndushay commented Sep 20, 2023

andrewjbtw commented Jul 14, 2023 •

edited

Loading

mjgiarlo commented Sep 18, 2023 •

edited

Loading

andrewjbtw commented Sep 18, 2023 •

edited

Loading