-
Notifications
You must be signed in to change notification settings - Fork 2
Validate moab step fails during preservationIngestWF
Note: this remediation involves working with the preservation-robots server rather than prescat but it is documented in the prescat wiki because
- the issue happens at the point where a moab is created and handed off for prescat to pick up
- there's no wiki for preservation robots and it seems easier to put the document here for now
The validate-moab
step stops at an error because one or more files in the most recent version folder in the moab do not match the checksums on the moab manifest. The error will look something like this:
"Problem with Moab validation run on preservation-catalog-prod-03.stanford.edu: [{"druid:gp362tm4122-v0001: version_additions: file_differences"=>{"digital_object_id"=>"druid:gp362tm4122|", "difference_count"=>1, "basis"=>"v1", "other"=>"/pres-03/sdr2objects/gp/362/tm/4122/gp362tm4122/v0001/data/content|/pres-03/sdr2objects/gp/362/tm/4122/gp362tm4122/v0001/data/metadata", "report_datetime"=>"2021-12-09T05:55:51Z", "group_differences"=>{"content"=>{"group_id"=>"content", "difference_count"=>1, "identical"=>4, "modified"=>1, "subsets"=>{"modified"=>{"change"=>"modified", "count"=>1, "files"=>{0=>{"change"=>"modified", "basis_path"=>"36105062472159-gb-jp2.zip", "other_path"=>"same", "signatures"=>{0=>{"size"=>376008908, "md5"=>"136e0ba1ad9aa1a3755a6a93fc389f0e", "sha1"=>"7098f91cec01490451398be324a2430256fb9be1", "sha256"=>"5737e26ab4c87dc347762d58aed5c0e3c8d347e1c6ab71e197e1d555baeacdc3"}, 1=>{"size"=>376008908, "md5"=>"8fe5db9211ea97dcbb05020afb316f06", "sha1"=>nil, "sha256"=>nil}}}}}}}, "metadata"=>{"group_id"=>"metadata", "difference_count"=>0, "identical"=>7}}}}]"
The problem here is that one or more of the files has been corrupted while being copied into the new moab version for that druid. When this has happened, it has corresponded to an issue with Ceph that required Ops to reboot one or more preservation-related servers.
Remediating this issue requires cleaning up (deleting) some files from the preservation storage system, so it's worth reviewing how the files got there. The preservationIngestWF does the following to create a moab:
-
transfer-object
: copies the files to be stored in the latest version of a druid from the/dor/export/{druid}
folder in common-accessioning to the adeposit
folder on the appropriate preservation mount (e.g./pres-##/deposit/{druid}
). The files in this folder are structured according to the BagIt specification. -
validate-bag
: runs BagIt validation to make sure that all files in the bag were copied correctly -
update-moab
: transforms the data contained in the bag (including some of the checksum info in the manifests) into the next moab version of the druid. At this point, data is copied from/pres-##/deposit/{druid}
to the new version folder for the moab following this pattern/pres-##/sdr2objects/{druidtree}/{druid}/v####
. This copy is made by hard linking the files in the deposit folder at/pres-##/deposit/{druid}
to the new version folder in the moab at/pres-##/sdr2objects/{druidtree}/{druid}/v####
.
The intention seems to have been to carry out this operation as a "move", which doesn't involve writing new bytes, but the current preservation filesystem processes this operation as a "copy" and writes to the new location, making it necessary to validate the moab in the following step. A side effect of this hard linking seems to be that the files in/pres-##/deposit/{druid}
get replaced with the data written to the moab folder, so if a file gets corrupted in this process, it's corrupted at both file paths since it's the same file. -
validate-moab
: validates that the files copied into the new moab version at/pres-##/sdr2objects/{druidtree}/{druid}/v####
are valid using the checksums in the moab manifests. If the validation passes, the hard links between the/pres-##/deposit/{druid}
and the moab folder and then the deposit folder is cleaned up.
What this means is that when validate-moab
fails, it's because some files were corrupted when the new moab version was created. The corruption could be to one or more of the content files, the metadata files, or the manifest files. Unfortunately, the deposit folder bag is usually also corrupted at this point, possibly because of the way the robots use hard links to copy the files between folders.
Since the moab is not valid and you generally can't trust the files in the bag either, remediating this issue means re-running the whole preservationIngestWF from the start, to re-copy the files from the /dor/export/{druid}
folder, which should have been unaffected by any corruption in the preservation system.
To do this, you first need to remove the invalid moab version folder that failed the update-moab
step. To do this:
-
ssh into the
preservation-robots
folder. You have to use this server because it mounts the preservation system read-write. You can't use prescat because the mounts are read-only. -
cd into the moab folder for the affected druid, for example
cd /pres-03/sdr2objects/yy/889/cc/1416/yy889cc1416/
-
Delete the folder for the invalid moab version. Be careful not to delete the whole moab, just the one (the most recent) version that failed validation. This is a risky operation since it's just you typing
rm
commands, but there may not be a way to make it safer without updating the pres-robots code to handle this type of failure in a different way.
So far the items requiring this remediation have all been the initial versions of objects, so thev0001
folder was the one to delete. But if the version that failed at update-moab was version 3, then you'd remove only thev0003
folder. The earlier version folders should be fine because that data was at rest when the problem occurred and should not have been corrupted. Only the in-flight data should be at risk during theupdate-moab
step. -
Once the moab version has been cleaned up, pull up a Rails console on the workflow-server-rails VM and restart the preservationIngestWF from the beginning by re-setting all steps to
waiting
and then running the workflow again.transfer-object
will re-copy the data from/dor/export/{druid}
and, assuming that data is still valid, the workflow should complete on its own.
- You can use an update statement like:
WorkflowStep.where(druid: druid, workflow: 'preservationIngestWF', version: version).update_all(status: 'waiting')
(wheredruid
is the prefixed druid, andversion
is the version being remediated for the mid-ingest error) - Once you've run the database update, you'll need to get the workflow started. The easiest way to do this is to pull up the druid in Argo, confirm that the
start-ingest
step ofpreservationIngestWF
iswaiting
, and then hit "Set to completed" on thatstart-ingest
step. Thestart-ingest
step is a no-op triggered whenaccessionWF
hands off topreservationIngestWF
, and marking itcompleted
will kick off the rest of the workflow. To do this programmatically, see the code in theworkflow:step
rake task in the workflow-server-rails codebase.
- To be extra safe, you may want to run a preservation audit on the druid rather than wait for the automated check to run in 90 days. But if it passed the whole preservationIngestWF cleanly the second time, it should be as valid as any other newly-accessioned data. To do so, pull up a Rails console on a preservation_catalog VM, and run
MoabRecord.by_druid(druid).validate_checksums!
, where druid is the unprefixed ("bare") druid, unlike on WFS. This will queue a checksum validation job. If there is no backlog of checksum validation jobs, it should get worked immediately. If an error is detected, a Honeybadger alert will be fired, thepreservationAuditWF
status will be updated in WFS and reflected in Argo, and the status will be updated in pres cat's DB.
screen
session, in case you lose your connection while working.
- Go to the Argo workflow grid: https://argo.stanford.edu/report/workflow_grid
- Scroll down to
preservationIngestWF
and click the link for thevalidate-moab
errors facet: https://argo.stanford.edu/catalog?f%5Bwf_wps_ssim%5D%5B%5D=preservationIngestWF%3Avalidate-moab%3Aerror - Click the
Columns
button and select only theDruid
andStatus
columns. - Click
Download
, giving the.csv
a descriptive name. - Using your favorite approach to text wrangling: remove the header column of the CSV, turn the
v1 In accessioning ...
style strings in the second column intov0001
style Moab directory style strings (using e.g.sed
, multi-selection and edit in your favorite text editor, etc). Confirm that the file you saved uses Unix-style line breaks (you might get Mac style line breaks if you used Excel on Mac to do your text wrangling)!dos2unix
can fix your line break problem. Note: if we switch away from using Bash for the parts of this where we use Bash, we can probably be less particular about some of this CSV generation - Script for removing only specified Moab versions from preservation storage
#!/bin/bash
druid_list=$1
while read line
do
druid=$(echo "$line" | cut -d ',' -f1)
version=$(echo "$line" | cut -d ',' -f2)
druid_tree=$(echo "$druid" | sed -r 's/([0-9])([a-zA-Z])/\1\/\2/g; s/([a-zA-Z])([0-9])/\1\/\2/g')
echo "$druid,$druid_tree,$version"
max_moab_version_path=$(find /pres-0*/sdr2objects/"$druid_tree/$druid" -mindepth 1 -maxdepth 1 -type d | sort | tail -n 1)
max_moab_version=$(echo "$max_moab_version_path" | cut -d '/' -f9)
if [ "$max_moab_version" == "$version" ]
then
echo "versions match"
rm -rv "$max_moab_version_path"
else
echo "version mismatch, quitting"
exit
fi
done < "$druid_list"
-
Resetting/rewinding
preservationIngestWF
for the druid versions from the report TODO: add the line of code for gettingdruid_list
from your report
RabbitFactory.start_global
druid_versions.each do |druid_version|
druid = druid_version[0]
version = druid_version[1]
query = WorkflowStep.where(druid: druid, workflow: 'preservationIngestWF', version: version)
# puts query.order(:druid, :workflow, :process).pluck(:druid, :workflow, :process, :status, :version)
query.update_all(status: 'waiting')
step = WorkflowStep.find_by(
druid: druid,
workflow: 'preservationIngestWF',
process: 'start-ingest',
version: version
)
step.update(status: 'completed')
next_step = WorkflowStep.find_by(
druid: druid,
workflow: 'preservationIngestWF',
process: 'transfer-object',
version: version
)
NextStepService.enqueue_next_steps(step: next_step)
SendUpdateMessage.publish(step: step)
end
- watching for issues/running audits
- Replication errors
- Validate moab step fails during preservationIngestWF
- ZipmakerJob failures
- Moab Audit Failures
- Ceph Errors
- Job queues
- Deposit bag was missing
- ActiveRecord and Replication intro
- 2018 Work Cycle Documentation
- Fixing a stuck Moab
- Adding a new cloud provider
- Audits (how to run as needed)
- Extracting segmented zipfiles
- AWS credentials, S3 configuration
- Zip Creation
- Storage Migration Additional Information
- Useful ActiveRecord queries
- IO against Ceph backed preservation storage is hanging indefinitely (steps to address IO problems, and follow on cleanup)