-
Notifications
You must be signed in to change notification settings - Fork 2
Replication errors
Each moab has multiple versions. A zip of each versions should be replicated to 3 endpoints. The zip for a version may be split into multiple parts depending on size, meaning a single version could have multiple zip files.
Run Audit::ReplicationSupport.zip_part_debug_info
using the rake task to get the replication state:
RAILS_ENV=production bin/rake prescat:diagnose_replication[dn073hg2651]
- If for a given endpoint and version, all are "ok" then the replication is correct.
druid,preserved object version,zipped moab version,endpoint,zip part status,zip part suffix,zipped moab parts count,zip part size,zip part created at,zip part updated at,zip part s3 key,zip part endpoint status
dn073hg2651,1,1,aws_s3_east_1,ok,.zip,1,26567,2023-01-12 21:10:44 UTC,2023-01-12 21:10:44 UTC,dn/073/hg/2651/dn073hg2651.v0001.zip,found at endpoint
dn073hg2651,1,1,aws_s3_west_2,ok,.zip,1,26567,2023-01-12 21:10:44 UTC,2023-01-12 21:10:44 UTC,dn/073/hg/2651/dn073hg2651.v0001.zip,found at endpoint
dn073hg2651,1,1,ibm_us_south,ok,.zip,1,26567,2023-01-12 21:10:44 UTC,2023-01-12 21:10:44 UTC,dn/073/hg/2651/dn073hg2651.v0001.zip,found at endpoint
- If for a given endpoint and version, all or some are "unreplicated" then it is unreplicated for that endpoint and version.
druid,preserved object version,zipped moab version,endpoint,zip part status,zip part suffix,zipped moab parts count,zip part size,zip part created at,zip part updated at,zip part s3 key,zip part endpoint status
dn073hg2651,1,1,aws_s3_east_1,unreplicated,.zip,1,26567,2023-01-12 21:10:44 UTC,2023-01-12 21:10:44 UTC,dn/073/hg/2651/dn073hg2651.v0001.zip,not found at endpoint
dn073hg2651,1,1,aws_s3_west_2,ok,.zip,1,26567,2023-01-12 21:10:44 UTC,2023-01-12 21:10:44 UTC,dn/073/hg/2651/dn073hg2651.v0001.zip,found at endpoint
dn073hg2651,1,1,ibm_us_south,unreplicated,.zip,1,26567,2023-01-12 21:10:44 UTC,2023-01-12 21:10:44 UTC,dn/073/hg/2651/dn073hg2651.v0001.zip,found at endpoint
- If for a given version, an endpoint is missing then it is "unreplicated" for that endpoint and version.
druid,preserved object version,zipped moab version,endpoint,zip part status,zip part suffix,zipped moab parts count,zip part size,zip part created at,zip part updated at,zip part s3 key,zip part endpoint status
dn073hg2651,1,1,aws_s3_east_1,ok,.zip,1,26567,2023-01-12 21:10:44 UTC,2023-01-12 21:10:44 UTC,dn/073/hg/2651/dn073hg2651.v0001.zip,found at endpoint
dn073hg2651,1,1,ibm_us_south,ok,.zip,1,26567,2023-01-12 21:10:44 UTC,2023-01-12 21:10:44 UTC,dn/073/hg/2651/dn073hg2651.v0001.zip,found at endpoint
- **Sometimes the problem may resolve on its own.
To verify this:,
Re-run diagnose_replication
:
RAILS_ENV=production bin/rake prescat:diagnose_replication[dn073hg2651]
OR
Run a Moab Replication Audit** (which will report issues to Honeybadger):
RAILS_ENV=production bin/rake prescat:audit:replication_single[fd812vz8360]
Diagnosis can also be performed in bulk from the Rails console using a list of druids. The following gets debug info for unreplicated zip parts:
druids = ZipPart.unreplicated.joins(zipped_moab_version: :preserved_object).pluck(:druid).uniq
debug_infos = Audit::ReplicationSupport.zip_part_debug_info(druids)
# Wait for it ...
CSV.open('debug_info.csv', 'wb') {|csv| debug_infos.each {|debug_info| csv << debug_info }}
- If any of the zip parts are unreplicated but found at the zip endpoint, then request that Ops delete the zips. (Only Ops has permissions to delete. Request deletes by filing an Ops ticket, providing the relevant lines from
zip_part_debug_info
.) - Run
Replication::FailureRemediator
using the rake task to prune the database:
RAILS_ENV=production bin/rake prescat:prune_failed_replication[bb001zc5754,9]
By default, this task does not prune any recent ZippedMoabVersions to avoid deleting records which may have jobs currently in process. To override this:
RAILS_ENV=production bin/rake prescat:prune_failed_replication[bb001zc5754,9,true]
- Backfill the pruned database records and initiate replication:
RAILS_ENV=production bin/rake prescat:backfill[bb001zc5754]
Replication jobs may take some amount of time to complete.
- Replication errors
- Validate moab step fails during preservationIngestWF
- ZipmakerJob failures
- Moab Audit Failures
- Ceph Errors
- Job queues
- Deposit bag was missing
- ActiveRecord and Replication intro
- 2018 Work Cycle Documentation
- Fixing a stuck Moab
- Adding a new cloud provider
- Audits (how to run as needed)
- Extracting segmented zipfiles
- AWS credentials, S3 configuration
- Zip Creation
- Storage Migration Additional Information
- Useful ActiveRecord queries
- IO against Ceph backed preservation storage is hanging indefinitely (steps to address IO problems, and follow on cleanup)