-
Notifications
You must be signed in to change notification settings - Fork 2
Fixing an Incomplete Moab Upload
- There are entries in the delivery failure (
*_delivery_failed
, e.g.s3_us_east_1_delivery_failed
).- The most common cause of this is a network blip when trying to reach the AWS or IBM S3 endpoint(s).
- A large Moab (>10GB) has only uploaded a portion of its archive files.
-
PartReplicationAuditJob
detects a problem with a replicated Moab (e.g. https://app.honeybadger.io/projects/54415/faults/64996758) - The
Settings.zip_storage
volume (e.g.sdr-transfers
) ran out of space during transfers/upload. - The
parts_count
value of the uploaded segments is incorrect.
First: if you see an alert about a delivery failure, you should first retry the failed delivery from the Resque web console, assuming the failure is less than 7 days old (threshold defined by Settings.zip_cache_expiry_time
). After zip_cache_expiry_time
, the archive zip will have been automatically cleaned from the zip generation temp space (and retrying the delivery job won't re-create it, because zip creation is an earlier step in the replication pipeline).
If the replication is complete after successfully retrying the failed delivery attempt, you are done. You can use the info here to help determine whether something was fully/successfully replicated: Investigating a druid with replication errors
If retrying failed delivery jobs is not an option for resolving the issue (either because the zip files have aged out of temp space, or because the issue stems at least in part from something other than failed delivery attempts), read on...
- Delete the
zip_parts
andzipped_moab_versions
database records for failed/partial replication attempts.⚠️ Please use the rake task described below, since it has safeguards against overly broad deletions. If you'd like to remediate in bulk from Rails console, please use theCatalogRemediator
class method that the rake task wraps. - Delete any/all remaining pieces of the Moab's zip from
/sdr-transfers
(or where everSettings.zip_storage
points).
- can be done by dev or ops from any of pres cat prod worker boxes, i.e. any other than -01
- can also just wait 7 days for the auto-cleanup to purge any remaining pieces - the rake task used below won't act on any ZMVs younger than
zip_cache_expiry_time
anyway.
- Delete any/all pieces of the failed Moab-version upload that made it to an S3 endpoint from that one endpoint. (ops task, delete access is restricted)
- Re-trigger the creation of any missing
zipped_moab_versions
andzip_parts
once the above steps have been completed for a mis-replicated druid. (dev task, see below)
pres@preservation-catalog-prod-02:~/preservation_catalog/current$ RAILS_ENV=production bin/rake prescat:prune_failed_replication[bb001zc5754,9]
pruned zipped moab version 9 on aws_s3_west_2
pruned zipped moab version 9 on aws_s3_east_1
pruned zipped moab version 9 on ibm_us_south
pres@preservation-catalog-prod-02:~/preservation_catalog/current$ RAILS_ENV=production bin/rake prescat:prune_failed_replication[zy046vr4617,1]
pruned zipped moab version 1 on ibm_us_south
You can pass along the above info to ops, e.g.: "Please delete the zip parts for version 1 of zy046vr4617 from ibm_us_south, if any exist."
If you want to see more detail from the logs about what was cleaned up, you can do something like:
pres@preservation-catalog-prod-02:~/preservation_catalog/current$ grep -B1 'Destroying zip parts' log/production.log
I, [2022-03-16T08:27:09.586829 #3477151] INFO -- : Replication failure error(s) found with bb001zc5754 (v9): [{:zip_parts_not_created=>"9 on aws_s3_west_2: no zip_parts exist yet for this ZippedMoabVersion"}]
Destroying zip parts ([]) and zipped moab version (20706447)
I, [2022-03-16T08:27:09.756150 #3477151] INFO -- : Replication failure error(s) found with bb001zc5754 (v9): [{:zip_parts_not_created=>"9 on aws_s3_east_1: no zip_parts exist yet for this ZippedMoabVersion"}]
Destroying zip parts ([]) and zipped moab version (20706448)
I, [2022-03-16T08:27:09.764788 #3477151] INFO -- : Replication failure error(s) found with bb001zc5754 (v9): [{:zip_parts_not_created=>"9 on ibm_us_south: no zip_parts exist yet for this ZippedMoabVersion"}]
Destroying zip parts ([]) and zipped moab version (20706449)
--
I, [2022-03-16T09:22:40.579252 #3483337] INFO -- : Replication failure error(s) found with zy046vr4617 (v1): [{:zip_parts_not_all_replicated=>"1 on ibm_us_south: not all ZippedMoabVersion parts are replicated yet: [#<ZipPart id: 13381599, size: 10737418240, zipped_moab_version_id: 13294700, created_at: \"2019-03-26 15:30:13.268487000 +0000\", updated_at: \"2019-03-26 15:30:13.268487000 +0000\", md5: \"438ebd78b335f8015ec8895cb9fb1346\", create_info: \"{:zip_cmd=>\\\"zip -r0X -s 10g /sdr-transfers/zy/046/...\", parts_count: 34, suffix: \".z30\", status: \"unreplicated\", last_existence_check: nil, last_checksum_validation: nil>]"}]
Destroying zip parts ([13381434, 13381411, 13381418, 13381426, 13381627, 13381438, 13381441, 13381445, 13381453, 13381457, 13381460, 13381464, 13381470, 13381483, 13381491, 13381499, 13381507, 13381521, 13381528, 13381531, 13381534, 13381538, 13381541, 13381544, 13381552, 13381561, 13381577, 13381585, 13381590, 13381599, 13381604, 13381612, 13381622, 13381408]) and zipped moab version (13294700)
In the above example, you can see that zy046vr4617 v1 is a 30 part druid version of which one part failed to replicate successfully. Once ops cleans up the other 29 (😢), it can be pushed through replication again (see below). bb001zc5754 can be pushed through immediately, as there were no zip parts pushed to S3 in the first place.
NOTE: if there were any zip_parts
records that were cleaned up for the druid version, confirm that any partial replication for the druid version has been cleaned up from S3. The most common way this situation occurs is when a large (> 10 GB) druid ran into network issues on delivery attempts for some but not all zip parts, e.g. if the failure queues show entries for a .z02
zip part but not for e.g. the .zip
or .z01
parts.
[1] pry(main)> PreservedObject.find_by(druid: 'dc156hp0190').create_zipped_moab_versions!
The create_zipped_moab_versions!
call forces PresCat to re-create any missing zipped_moab_versions
for the druid, including anything that was cleaned up by the remediation described above. This will also trigger the rest of the replication pipeline, causing fresh archive zips to be generated and pushed to the S3 endpoints (hence why any partial uploads must be cleaned up from S3 before that's done, since pres cat can't overwrite zip parts that've been uploaded to S3 already).
- Replication errors
- Validate moab step fails during preservationIngestWF
- ZipmakerJob failures
- Moab Audit Failures
- Ceph Errors
- Job queues
- Deposit bag was missing
- ActiveRecord and Replication intro
- 2018 Work Cycle Documentation
- Fixing a stuck Moab
- Adding a new cloud provider
- Audits (how to run as needed)
- Extracting segmented zipfiles
- AWS credentials, S3 configuration
- Zip Creation
- Storage Migration Additional Information
- Useful ActiveRecord queries
- IO against Ceph backed preservation storage is hanging indefinitely (steps to address IO problems, and follow on cleanup)