Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Analyze whether checksum robot steps can be made more efficient (i.e. take less time to run) #1074

Closed
andrewjbtw opened this issue Jul 14, 2023 · 10 comments
Assignees
Labels
needs analysis cannot proceed with this issue without analysis tech debt

Comments

@andrewjbtw
Copy link

andrewjbtw commented Jul 14, 2023

Background

There are two steps in the SDR accessioning workflows that can take an extraordinary amount of time when processing large (multi-TB) content. By extraordinary, I mean 1-5 days for just the one robot step. Both steps are steps that check and/or calculate checksums:

checksum-compute in the assemblyWF
sdr-ingest-transfer in the accessionWF

Analysis requested

This ticket is for analysis and investigation into whether these steps are running as efficiently as they can given our hardware and network bandwidth constraints, or whether they can be sped up.

I'm very curious whether these steps are reading each file in a druid more than once. For example, the checksum-compute step may carry out multiple tasks:

  • generate MD5 checksum for every file
  • generate SHA1 checksum for every file
  • verify user-supplied MD5 check for every file

If the robot step is reading each file more than once, would it be possible to modify the step to carry out only one read? There's no way to generate checksums with less than one full read of each file, but it may not be necessary to read more than once in order to carry out the step.

Some evidence that the checksum-compute step may be reading files multiple times

I've been monitoring the progress of checksum-compute on a druid with 5 TB of data. The tool I've been using is lsof, which lists open files. By grepping for the druid, I get a list of files that are being processed that are associated with that druid.

$ lsof -o | grep sz929gx7593
ruby      1900060                        lyberadmin    9r      REG               0,57 0x99d4fb4000 161997549 /dor/assembly/sz/929/gx/7593/sz929gx7593/content/sz929gx7593_5_pm.mkv (sf6-webapp:/sul_lyberservices_prod_01)
ruby      1900060 1901089 Timeout        lyberadmin    9r      REG               0,57 0x99d4f90000 161997549 /dor/assembly/sz/929/gx/7593/sz929gx7593/content/sz929gx7593_5_pm.mkv (sf6-webapp:/sul_lyberservices_prod_01)
ruby      1900060 1901090 sidekiq.h      lyberadmin    9r      REG               0,57 0x99d4f98000 161997549 /dor/assembly/sz/929/gx/7593/sz929gx7593/content/sz929gx7593_5_pm.mkv (sf6-webapp:/sul_lyberservices_prod_01)
ruby      1900060 1901093 sidekiq.d      lyberadmin    9r      REG               0,57 0x99d4f9c000 161997549 /dor/assembly/sz/929/gx/7593/sz929gx7593/content/sz929gx7593_5_pm.mkv (sf6-webapp:/sul_lyberservices_prod_01)
ruby      1900060 1901132 worker.rb      lyberadmin    9r      REG               0,57 0x99d4fa8000 161997549 /dor/assembly/sz/929/gx/7593/sz929gx7593/content/sz929gx7593_5_pm.mkv (sf6-webapp:/sul_lyberservices_prod_01)
ruby      1900060 2647395 launcher.      lyberadmin    9r      REG               0,57 0x99d4fb0000 161997549 /dor/assembly/sz/929/gx/7593/sz929gx7593/content/sz929gx7593_5_pm.mkv (sf6-webapp:/sul_lyberservices_prod_01)
screen    3892847                        lyberadmin  cwd       DIR               0,57              152134624 /dor/assembly/sz/929/gx/7593/sz929gx7593/content (sf6-webapp:/sul_lyberservices_prod_01)

The lsof -o option outputs the byte offset for the opened file, which I'm interpreting as an indicator of how much of the file has been read so far. Repeatedly running lsof -o is a way to track progress because you can see the byte offset increase until the whole file is read.

I wrote a bit of bash to translate the file offset into GB, so it would be easier to monitor progress.

while true
do 
	fileinfo=$(lsof -o | grep sz929gx7593 | head -n1 | tr -s ' ') # take the first line of lsof output and pull out just the file name and byte offset
	filepath=$(echo "$fileinfo" | cut -d ' ' -f9)
	bytecount=$(echo "$fileinfo" | cut -d ' ' -f7)
	echo "$(printf "%d\n" $bytecount) / 1073741824" | bc | tr '\n' ' ' # translate the hex to decimal, then convert bytes to GB
	echo "$filepath"
	sleep 300
done

This outputs one line every five minutes with the number of GB read and the filepath of the currently open file:

407 /dor/assembly/sz/929/gx/7593/sz929gx7593/content/sz929gx7593_4_pm.mkv
418 /dor/assembly/sz/929/gx/7593/sz929gx7593/content/sz929gx7593_4_pm.mkv
427 /dor/assembly/sz/929/gx/7593/sz929gx7593/content/sz929gx7593_4_pm.mkv

After monitoring accessioning for a couple of days, I've noticed from the output of this script that every file appears to be read twice. The byte offset increases until the full size of the file is reached, then starts over again at 0.

585 /dor/assembly/sz/929/gx/7593/sz929gx7593/content/sz929gx7593_4_pm.mkv
596 /dor/assembly/sz/929/gx/7593/sz929gx7593/content/sz929gx7593_4_pm.mkv 
0 /dor/assembly/sz/929/gx/7593/sz929gx7593/content/sz929gx7593_4_pm.mkv
7 /dor/assembly/sz/929/gx/7593/sz929gx7593/content/sz929gx7593_4_pm.mkv
17 /dor/assembly/sz/929/gx/7593/sz929gx7593/content/sz929gx7593_4_pm.mkv
...
614 /dor/assembly/sz/929/gx/7593/sz929gx7593/content/sz929gx7593_5_pm.mkv
630 /dor/assembly/sz/929/gx/7593/sz929gx7593/content/sz929gx7593_5_pm.mkv
13 /dor/assembly/sz/929/gx/7593/sz929gx7593/content/sz929gx7593_5_pm.mkv
24 /dor/assembly/sz/929/gx/7593/sz929gx7593/content/sz929gx7593_5_pm.mkv

If this is accurate, it suggests that there could be significant time savings if we could reduce the number of reads to a single pass.

@andrewjbtw
Copy link
Author

One more bit of context: there are checksum steps in the preservationIngestWF. These tend to run more quickly than checksum-compute and sdr-ingest-transfer on the same content. It's not clear if that's due to differences in hardware - they run after files have reached preservation storage - or implementation strategy, or both.

@ndushay ndushay added needs analysis cannot proceed with this issue without analysis tech debt labels Sep 12, 2023
@justinlittman
Copy link
Contributor

  • common accesioning checksum-compute uses Assembly::ObjectFile which performs individual digests (meaning the file is read multiple times).
  • pres cat checksum validator uses Moab::FileSignature which computes multiple digests from a single read.
  • pres robots validate bag uses DepositBagValidator which computes multiple digests from a single read.

@mjgiarlo
Copy link
Member

mjgiarlo commented Sep 18, 2023

We think we've done sufficient analysis and made one key improvement in #1100 so we're going to say this is done, other than @andrewjbtw and @ndushay chatting about whether there's work left to be done as part of sdr-ingest-transfer.

@justinlittman
Copy link
Contributor

sdr-ingest transfer invokes DSA preserve endpoint. That in turn calls the PreserveJob, which calls PreservationIngestService.transfer and then starts the preservation workflow.

PreservationIngestService creates the bag. I don't believe that fixities are generated or checked in this step, but it is quite possible that some process in creating the bag is slow.

@andrewjbtw
Copy link
Author

andrewjbtw commented Sep 18, 2023

Somewhere between sdr-ingest-transfer (in accessionWF) and transfer-object (in preservationIngestWF), the following things happen:

  • checksums are validated (MD5, SHA1)
  • new checksum (sha256) is computed and added to Cocina
  • bag is created in /dor/export with checksum manifests for MD5, SHA1, and SHA256 and symlinks to the content

From my observation of accessioning progress, a job called "PreserveJob" runs and is visible in the Sidekiq UI for DSA. However, this job appears to complete before preservationIngestWF is initiated.

When sdr-ingest-transfer is finished, the "duration" is set to 1 second. It definitely does not take 1 second. For example, if you look at the timestamps in the XML, 9 days elapsed between the publish step and the completion of sdr-ingest-transfer on this druid: https://argo.stanford.edu/view/druid:sz929gx7593

It is possible that the names of things make it confusing to determine what each step does around the preservation handoff. I can only speak to what it looks like in the workflow view.

@andrewjbtw
Copy link
Author

Additionally, fixities are definitely checked at sdr-ingest-transfer because occasional fixity errors have been reported at that step. I can probably deliberately make that happen by corrupting a file.

@andrewjbtw
Copy link
Author

Example of a checksum validation failure at sdr-ingest-transfer. I deliberately changed file.txt after checksum-compute and before sdr-ingest-transfer:

problem with sdr-ingest-transfer (BackgroundJob: 678912): {:errors=>[{:title=>"Preservation error", :detail=>"Signature inconsistent between inventory and file for /dor/workspace/vy/213/xd/2348/vy213xd2348/content/file.txt: {"size"=>{:old=>13, :new=>"13"}, "md5"=>{:old=>"76f77e3250372614c9f9b0861f9e5ce6", :new=>"18519bfbd592b4e6cb238c5ccdbc209f"}, "sha1"=>{:old=>"d468a0236957a752c47bc89f86581da5e90a2b12", :new=>"8d905cdc9e9957824ab1c8f574733bf48a8ba46b"}, "sha256"=>{:old=>"f355877682a4d122a5a880b676a933f77558d70a1a46099d62e56c0dcd62fa32", :new=>nil}}"}]}

https://argo-stage.stanford.edu/view/druid:vy213xd2348

@justinlittman
Copy link
Contributor

Thanks for all of the details @andrewjbtw. Based on what you reported, the culprit is the unhelpfully named normalized_signature. However, it calls FileSignature.from_file which is optimized for a single read.

The sdr-transfer / bagger code is a warren, so possible that there is some other inefficiency.

@ndushay
Copy link
Contributor

ndushay commented Sep 19, 2023

I think we may want to have a storytime to come to shared understanding of the full process from preassembly through final accessioning, where checksums are computed and under which conditions, and whether we can reduce the number of time we do it.

@ndushay
Copy link
Contributor

ndushay commented Sep 20, 2023

I am closing this ticket in favor of sul-dlss/dor-services-app#4594. Andrew has indicated he is unaware of any other efficiencies we can chase.

@ndushay ndushay closed this as completed Sep 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs analysis cannot proceed with this issue without analysis tech debt
Projects
None yet
Development

No branches or pull requests

4 participants