Skip to content

Replication errors

Naomi Dushay edited this page Apr 14, 2023 · 38 revisions

Overview

Each moab has multiple versions. A zip of each versions should be replicated to 3 endpoints. The zip for a version may be split into multiple parts depending on size, meaning a single version could have multiple zip files.

Diagnosis

Run Audit::ReplicationSupport.zip_part_debug_info using the rake task to get the replication state:

RAILS_ENV=production bin/rake prescat:diagnose_replication[dn073hg2651]
  • If for a given endpoint and version, all are "ok" then the replication is correct.
druid,preserved object version,zipped moab version,endpoint,zip part status,zip part suffix,zipped moab parts count,zip part size,zip part created at,zip part updated at,zip part s3 key,zip part endpoint status
dn073hg2651,1,1,aws_s3_east_1,ok,.zip,1,26567,2023-01-12 21:10:44 UTC,2023-01-12 21:10:44 UTC,dn/073/hg/2651/dn073hg2651.v0001.zip,found at endpoint
dn073hg2651,1,1,aws_s3_west_2,ok,.zip,1,26567,2023-01-12 21:10:44 UTC,2023-01-12 21:10:44 UTC,dn/073/hg/2651/dn073hg2651.v0001.zip,found at endpoint
dn073hg2651,1,1,ibm_us_south,ok,.zip,1,26567,2023-01-12 21:10:44 UTC,2023-01-12 21:10:44 UTC,dn/073/hg/2651/dn073hg2651.v0001.zip,found at endpoint
  • If for a given endpoint and version, all or some are "unreplicated" then it is unreplicated for that endpoint and version.
druid,preserved object version,zipped moab version,endpoint,zip part status,zip part suffix,zipped moab parts count,zip part size,zip part created at,zip part updated at,zip part s3 key,zip part endpoint status
dn073hg2651,1,1,aws_s3_east_1,unreplicated,.zip,1,26567,2023-01-12 21:10:44 UTC,2023-01-12 21:10:44 UTC,dn/073/hg/2651/dn073hg2651.v0001.zip,not found at endpoint
dn073hg2651,1,1,aws_s3_west_2,ok,.zip,1,26567,2023-01-12 21:10:44 UTC,2023-01-12 21:10:44 UTC,dn/073/hg/2651/dn073hg2651.v0001.zip,found at endpoint
dn073hg2651,1,1,ibm_us_south,unreplicated,.zip,1,26567,2023-01-12 21:10:44 UTC,2023-01-12 21:10:44 UTC,dn/073/hg/2651/dn073hg2651.v0001.zip,found at endpoint
  • If for a given version, an endpoint is missing then it is "unreplicated" for that endpoint and version.
druid,preserved object version,zipped moab version,endpoint,zip part status,zip part suffix,zipped moab parts count,zip part size,zip part created at,zip part updated at,zip part s3 key,zip part endpoint status
dn073hg2651,1,1,aws_s3_east_1,ok,.zip,1,26567,2023-01-12 21:10:44 UTC,2023-01-12 21:10:44 UTC,dn/073/hg/2651/dn073hg2651.v0001.zip,found at endpoint
dn073hg2651,1,1,ibm_us_south,ok,.zip,1,26567,2023-01-12 21:10:44 UTC,2023-01-12 21:10:44 UTC,dn/073/hg/2651/dn073hg2651.v0001.zip,found at endpoint

Sometimes the problem may resolve on its own. To verify this, run a Moab Replication Audit (which will report issues to Honeybadger):

RAILS_ENV=production bin/rake prescat:audit:replication_single[fd812vz8360]

Bulk diagnosis

Diagnosis can also be performed in bulk from the Rails console using a list of druids. The following gets debug info for unreplicated zip parts:

druids = ZipPart.unreplicated.joins(zipped_moab_version: :preserved_object).pluck(:druid).uniq
debug_infos = Audit::ReplicationSupport.zip_part_debug_info(druids)
# Wait for it ...
CSV.open('debug_info.csv', 'wb') {|csv| debug_infos.each {|debug_info| csv << debug_info }}

Troubleshooting

  1. If any of the zip parts are unreplicated but found at the zip endpoint, then request that Ops delete the zips. (Only Ops has permissions to delete. Request deletes by filing an Ops ticket, providing the relevant lines from zip_part_debug_info.)
  2. Run Replication::FailureRemediator using the rake task to prune the database:
RAILS_ENV=production bin/rake prescat:prune_failed_replication[bb001zc5754,9]

By default, this task does not prune any recent ZippedMoabVersions to avoid deleting records which may have jobs currently in process. To override this:

RAILS_ENV=production bin/rake prescat:prune_failed_replication[bb001zc5754,9,true]
  1. Backfill the pruned database records and initiate replication:
RAILS_ENV=production bin/rake prescat:backfill[bb001zc5754]

Replication jobs may take some amount of time to complete.

Clone this wiki locally