Skip to content

Replication errors

Naomi Dushay edited this page Apr 14, 2023 · 38 revisions

Overview

Each moab has multiple versions. A zip of each versions should be replicated to 3 endpoints. The zip for a version may be split into multiple parts depending on size, meaning a single version could have multiple zip files.

Diagnosis for a Single Druid

Run Audit::ReplicationSupport.zip_part_debug_info using the rake task to get the replication state:

RAILS_ENV=production bin/rake prescat:diagnose_replication[dn073hg2651]
  • If for a given endpoint and version, all are "ok" then the replication is correct.
druid,preserved object version,zipped moab version,endpoint,zip part status,zip part suffix,zipped moab parts count,zip part size,zip part created at,zip part updated at,zip part s3 key,zip part endpoint status
dn073hg2651,1,1,aws_s3_east_1,ok,.zip,1,26567,2023-01-12 21:10:44 UTC,2023-01-12 21:10:44 UTC,dn/073/hg/2651/dn073hg2651.v0001.zip,found at endpoint
dn073hg2651,1,1,aws_s3_west_2,ok,.zip,1,26567,2023-01-12 21:10:44 UTC,2023-01-12 21:10:44 UTC,dn/073/hg/2651/dn073hg2651.v0001.zip,found at endpoint
dn073hg2651,1,1,ibm_us_south,ok,.zip,1,26567,2023-01-12 21:10:44 UTC,2023-01-12 21:10:44 UTC,dn/073/hg/2651/dn073hg2651.v0001.zip,found at endpoint
  • If for a given endpoint and version, all or some are "unreplicated" then it is unreplicated for that endpoint and version.
druid,preserved object version,zipped moab version,endpoint,zip part status,zip part suffix,zipped moab parts count,zip part size,zip part created at,zip part updated at,zip part s3 key,zip part endpoint status
dn073hg2651,1,1,aws_s3_east_1,unreplicated,.zip,1,26567,2023-01-12 21:10:44 UTC,2023-01-12 21:10:44 UTC,dn/073/hg/2651/dn073hg2651.v0001.zip,not found at endpoint
dn073hg2651,1,1,aws_s3_west_2,ok,.zip,1,26567,2023-01-12 21:10:44 UTC,2023-01-12 21:10:44 UTC,dn/073/hg/2651/dn073hg2651.v0001.zip,found at endpoint
dn073hg2651,1,1,ibm_us_south,unreplicated,.zip,1,26567,2023-01-12 21:10:44 UTC,2023-01-12 21:10:44 UTC,dn/073/hg/2651/dn073hg2651.v0001.zip,found at endpoint
  • If for a given version, an endpoint is missing then it is "unreplicated" for that endpoint and version.
druid,preserved object version,zipped moab version,endpoint,zip part status,zip part suffix,zipped moab parts count,zip part size,zip part created at,zip part updated at,zip part s3 key,zip part endpoint status
dn073hg2651,1,1,aws_s3_east_1,ok,.zip,1,26567,2023-01-12 21:10:44 UTC,2023-01-12 21:10:44 UTC,dn/073/hg/2651/dn073hg2651.v0001.zip,found at endpoint
dn073hg2651,1,1,ibm_us_south,ok,.zip,1,26567,2023-01-12 21:10:44 UTC,2023-01-12 21:10:44 UTC,dn/073/hg/2651/dn073hg2651.v0001.zip,found at endpoint
  • Sometimes the problem may resolve on its own. To verify this:

Re-run diagnose_replication:

RAILS_ENV=production bin/rake prescat:diagnose_replication[dn073hg2651]

OR

Run a Moab Replication Audit (which will report issues to Honeybadger):

RAILS_ENV=production bin/rake prescat:audit:replication_single[fd812vz8360]

Diagnosis for a list of Druids

Diagnosis can also be performed in bulk from the Rails console using a list of druids. The following gets debug info for unreplicated zip parts:

druids = ZipPart.unreplicated.joins(zipped_moab_version: :preserved_object).pluck(:druid).uniq
debug_infos = Audit::ReplicationSupport.zip_part_debug_info(druids)
# Wait for it ...
CSV.open('debug_info.csv', 'wb') {|csv| debug_infos.each {|debug_info| csv << debug_info }}

The column names are not provided; the list below is from app/services/audit/replication_support.rb

  • druid
  • current_version
  • zipped_moab_version.version
  • endpoint_name
  • zip_part.status
  • zip_part.suffix (anything other than .zip implies it is a multi-part zip and the parts include .zip)
  • zip_part.parts_count (number of parts to the zip)
  • zip_part.size
  • zip_part.md5
  • zip_part.id
  • zip_part.created_at
  • zip_part.updated_at
  • zip_part.s3_key
  • 'found at endpoint' OR 'not found at endpoint'
  • s3_part.metadata['checksum_md5']

Troubleshooting

Case 1: ALL of the zips for a version are "not found at endpoint" and the zip part status is "unreplicated"

The output from rake prescat:diagnose_replication might look like this:

$ RAILS_ENV=production bin/rake prescat:diagnose_replication[bn435ff2092]
druid,preserved object version,zipped moab version,endpoint,zip part status,zip part suffix,zipped moab parts count,zip part size,zip part md5,zip part id,zip part created at,zip part updated at,zip part s3 key,zip part endpoint status,zip part endpoint md5
bn435ff2092,2,1,aws_s3_east_1,ok,.zip,1,266315221,f3df854dced7c26a1ac31285a1cf3cad,259468306,2022-05-23 20:48:49 UTC,2023-04-13 22:31:49 UTC,bn/435/ff/2092/bn435ff2092.v0001.zip,found at endpoint,f3df854dced7c26a1ac31285a1cf3cad
bn435ff2092,2,1,aws_s3_west_2,ok,.zip,1,266315221,f3df854dced7c26a1ac31285a1cf3cad,259468305,2022-05-23 20:48:49 UTC,2023-04-13 22:31:48 UTC,bn/435/ff/2092/bn435ff2092.v0001.zip,found at endpoint,f3df854dced7c26a1ac31285a1cf3cad
bn435ff2092,2,1,ibm_us_south,ok,.zip,1,266315221,f3df854dced7c26a1ac31285a1cf3cad,259468308,2022-05-23 20:48:49 UTC,2023-04-13 22:31:50 UTC,bn/435/ff/2092/bn435ff2092.v0001.zip,found at endpoint,f3df854dced7c26a1ac31285a1cf3cad
bn435ff2092,2,2,aws_s3_east_1,unreplicated,.zip,1,45248,00500cdeb23c3412cbd300296e7df0d4,262892115,2023-04-13 04:05:30 UTC,2023-04-13 04:05:30 UTC,bn/435/ff/2092/bn435ff2092.v0002.zip,not found at endpoint,
bn435ff2092,2,2,aws_s3_west_2,unreplicated,.zip,1,45248,00500cdeb23c3412cbd300296e7df0d4,262892109,2023-04-13 04:05:30 UTC,2023-04-13 04:05:30 UTC,bn/435/ff/2092/bn435ff2092.v0002.zip,not found at endpoint,
bn435ff2092,2,2,ibm_us_south,unreplicated,.zip,1,45248,00500cdeb23c3412cbd300296e7df0d4,262892112,2023-04-13 04:05:30 UTC,2023-04-13 04:05:30 UTC,bn/435/ff/2092/bn435ff2092.v0002.zip,not found at endpoint,

Note that in this case, zipped moab version 1 has status "ok" for all endpoints for the first version. Zipped moab version 2 has status "unreplicated" for all endpoints and the zip endpoint status is "not found at endpoint" for all endpoints as well.

**Case 1 Remediation: Step 1 - Prune the database records **

Prune the database records for the zip_parts that are not at the endpoints. There are three arguments to the rake task: the druid, the zip part version, and whether to assume that an existing record means we shouldn't mess with it (because it might be a replication in progress or in the retry queue on sidekiq)

$ RAILS_ENV=production bin/rake prescat:prune_failed_replication[bn435ff2092,2,false]
pruned zipped moab version 2 on aws_s3_west_2
pruned zipped moab version 2 on ibm_us_south
pruned zipped moab version 2 on aws_s3_east_1

Note that pruning provides a message for each record pruned.

Re-run RAILS_ENV=production bin/rake prescat:diagnose_replication[yer-druid] to confirm that the pruning worked.

Case 1 Remediation: Step 2 - Rerun replication audit

Re-run the replication audit for the druid. We have backfill on by default, so once the database records are removed, the audit will do the replication.

$ RAILS_ENV=production bin/rake prescat:audit:replication_single[bn320qt6030]

This will report errors to Honeybadger.

Re-run RAILS_ENV=production bin/rake prescat:diagnose_replication[yer-druid] to confirm that the replication worked.

Case 2: SOME of the zips for a version are "not found at endpoint" and the zip part status is "unreplicated"

In this case, we only want to remediate replication for the missing endpoint(s) for the version; we can leave the zip_parts with "ok" status and "found at endpoint" alone.

**Case 2 Remediation: Step 1 - Prune the database records **

Prune the database records for the zip_parts that are not at the endpoints. There are three arguments to the rake task: the druid, the zip part version, and whether to assume that an existing record means we shouldn't mess with it (because it might be a replication in progress or in the retry queue on sidekiq)

$ RAILS_ENV=production bin/rake prescat:prune_failed_replication[bn435ff2092,2,false]
pruned zipped moab version 2 on aws_s3_west_2
pruned zipped moab version 2 on ibm_us_south
pruned zipped moab version 2 on aws_s3_east_1

Note that pruning provides a message for each record pruned.

Re-run RAILS_ENV=production bin/rake prescat:diagnose_replication[yer-druid] to confirm that the pruning worked.

Case 3: a zip_part status is "unreplicated" but it is "found at endpoint" and the checksums match

Case 4: a zip_part status is "unreplicated" but it is "found at endpoint" and the checksums don't match

  1. If any of the zip parts are unreplicated but found at the zip endpoint, then request that Ops delete the zips. (Only Ops has permissions to delete. Request deletes by filing an Ops ticket, providing the relevant lines from zip_part_debug_info.)
  2. Run Replication::FailureRemediator using the rake task to prune the database:
RAILS_ENV=production bin/rake prescat:prune_failed_replication[bb001zc5754,9]

By default, this task does not prune any recent ZippedMoabVersions to avoid deleting records which may have jobs currently in process. To override this:

RAILS_ENV=production bin/rake prescat:prune_failed_replication[bb001zc5754,9,true]
  1. Backfill the pruned database records and initiate replication:
RAILS_ENV=production bin/rake prescat:backfill[bb001zc5754]

Replication jobs may take some amount of time to complete.

Clone this wiki locally