Skip to content

Replication errors

Naomi Dushay edited this page Apr 17, 2023 · 38 revisions

Overview

Each moab has multiple versions. A zip of each versions should be replicated to 3 endpoints. The zip for a version may be split into multiple parts depending on size, meaning a single version could have multiple zip files.

Finding Which Druids Are Impacted:

Run this from the rails console:

druids = ZipPart.unreplicated.joins(zipped_moab_version: :preserved_object).pluck(:druid).uniq

Diagnosis for a Single Druid

Run RAILS_ENV=production bin/rake prescat:diagnose_replication[(druid)] using the rake task to get the replication state:

RAILS_ENV=production bin/rake prescat:diagnose_replication[dn073hg2651]
  • If for a given endpoint and version, all are "ok" then the replication is correct.
druid,preserved object version,zipped moab version,endpoint,zip part status,zip part suffix,zipped moab parts count,zip part size,zip part created at,zip part updated at,zip part s3 key,zip part endpoint status
dn073hg2651,1,1,aws_s3_east_1,ok,.zip,1,26567,2023-01-12 21:10:44 UTC,2023-01-12 21:10:44 UTC,dn/073/hg/2651/dn073hg2651.v0001.zip,found at endpoint
dn073hg2651,1,1,aws_s3_west_2,ok,.zip,1,26567,2023-01-12 21:10:44 UTC,2023-01-12 21:10:44 UTC,dn/073/hg/2651/dn073hg2651.v0001.zip,found at endpoint
dn073hg2651,1,1,ibm_us_south,ok,.zip,1,26567,2023-01-12 21:10:44 UTC,2023-01-12 21:10:44 UTC,dn/073/hg/2651/dn073hg2651.v0001.zip,found at endpoint
  • If for a given endpoint and version, all or some are "unreplicated" then it is unreplicated for that endpoint and version.
druid,preserved object version,zipped moab version,endpoint,zip part status,zip part suffix,zipped moab parts count,zip part size,zip part created at,zip part updated at,zip part s3 key,zip part endpoint status
dn073hg2651,1,1,aws_s3_east_1,unreplicated,.zip,1,26567,2023-01-12 21:10:44 UTC,2023-01-12 21:10:44 UTC,dn/073/hg/2651/dn073hg2651.v0001.zip,not found at endpoint
dn073hg2651,1,1,aws_s3_west_2,ok,.zip,1,26567,2023-01-12 21:10:44 UTC,2023-01-12 21:10:44 UTC,dn/073/hg/2651/dn073hg2651.v0001.zip,found at endpoint
dn073hg2651,1,1,ibm_us_south,unreplicated,.zip,1,26567,2023-01-12 21:10:44 UTC,2023-01-12 21:10:44 UTC,dn/073/hg/2651/dn073hg2651.v0001.zip,found at endpoint
  • If for a given version, an endpoint is missing then it is "unreplicated" for that endpoint and version.
druid,preserved object version,zipped moab version,endpoint,zip part status,zip part suffix,zipped moab parts count,zip part size,zip part created at,zip part updated at,zip part s3 key,zip part endpoint status
dn073hg2651,1,1,aws_s3_east_1,ok,.zip,1,26567,2023-01-12 21:10:44 UTC,2023-01-12 21:10:44 UTC,dn/073/hg/2651/dn073hg2651.v0001.zip,found at endpoint
dn073hg2651,1,1,ibm_us_south,ok,.zip,1,26567,2023-01-12 21:10:44 UTC,2023-01-12 21:10:44 UTC,dn/073/hg/2651/dn073hg2651.v0001.zip,found at endpoint
  • Sometimes the problem may resolve on its own. To verify this:

Re-run diagnose_replication:

RAILS_ENV=production bin/rake prescat:diagnose_replication[dn073hg2651]

OR

Run a Moab Replication Audit (which will report issues to Honeybadger):

RAILS_ENV=production bin/rake prescat:audit:replication_single[fd812vz8360]

Diagnosis for a list of Druids

Diagnosis can also be performed in bulk from the Rails console using a list of druids. The following gets debug info for unreplicated zip parts:

druids = ZipPart.unreplicated.joins(zipped_moab_version: :preserved_object).pluck(:druid).uniq
debug_infos = Audit::ReplicationSupport.zip_part_debug_info(druids)
# Wait for it ...
CSV.open('debug_info.csv', 'wb') {|csv| debug_infos.each {|debug_info| csv << debug_info }}

The column names are not provided, they are in app/services/audit/replication_support.rb, and are the same as for the diagnose_replication task:

druid,preserved object version,zipped moab version,endpoint,zip part status,zip part suffix,zipped moab parts count,zip part size,zip part md5,zip part id,zip part created at,zip part updated at,zip part s3 key,zip part endpoint status,zip part endpoint md5

Troubleshooting

Case 1: ALL of the zips for a version are "not found at endpoint" and the zip part status is "unreplicated"

The output from rake prescat:diagnose_replication might look like this:

$ RAILS_ENV=production bin/rake prescat:diagnose_replication[bn435ff2092]
druid,preserved object version,zipped moab version,endpoint,zip part status,zip part suffix,zipped moab parts count,zip part size,zip part md5,zip part id,zip part created at,zip part updated at,zip part s3 key,zip part endpoint status,zip part endpoint md5
bn435ff2092,2,1,aws_s3_east_1,ok,.zip,1,266315221,f3df854dced7c26a1ac31285a1cf3cad,259468306,2022-05-23 20:48:49 UTC,2023-04-13 22:31:49 UTC,bn/435/ff/2092/bn435ff2092.v0001.zip,found at endpoint,f3df854dced7c26a1ac31285a1cf3cad
bn435ff2092,2,1,aws_s3_west_2,ok,.zip,1,266315221,f3df854dced7c26a1ac31285a1cf3cad,259468305,2022-05-23 20:48:49 UTC,2023-04-13 22:31:48 UTC,bn/435/ff/2092/bn435ff2092.v0001.zip,found at endpoint,f3df854dced7c26a1ac31285a1cf3cad
bn435ff2092,2,1,ibm_us_south,ok,.zip,1,266315221,f3df854dced7c26a1ac31285a1cf3cad,259468308,2022-05-23 20:48:49 UTC,2023-04-13 22:31:50 UTC,bn/435/ff/2092/bn435ff2092.v0001.zip,found at endpoint,f3df854dced7c26a1ac31285a1cf3cad
bn435ff2092,2,2,aws_s3_east_1,unreplicated,.zip,1,45248,00500cdeb23c3412cbd300296e7df0d4,262892115,2023-04-13 04:05:30 UTC,2023-04-13 04:05:30 UTC,bn/435/ff/2092/bn435ff2092.v0002.zip,not found at endpoint,
bn435ff2092,2,2,aws_s3_west_2,unreplicated,.zip,1,45248,00500cdeb23c3412cbd300296e7df0d4,262892109,2023-04-13 04:05:30 UTC,2023-04-13 04:05:30 UTC,bn/435/ff/2092/bn435ff2092.v0002.zip,not found at endpoint,
bn435ff2092,2,2,ibm_us_south,unreplicated,.zip,1,45248,00500cdeb23c3412cbd300296e7df0d4,262892112,2023-04-13 04:05:30 UTC,2023-04-13 04:05:30 UTC,bn/435/ff/2092/bn435ff2092.v0002.zip,not found at endpoint,

Note that in this case, zipped moab version 1 has status "ok" for all endpoints for the first version. Zipped moab version 2 has status "unreplicated" for all endpoints and the zip endpoint status is "not found at endpoint" for all endpoints as well.

Case 1 Remediation: Step 1 - Prune the database records

Prune the database records for the zip_parts that are not at the endpoints. There are three arguments to the rake task: the druid, the zip part version, and whether to assume that an existing record means we shouldn't mess with it (because it might be a replication in progress or in the retry queue on sidekiq)

$ RAILS_ENV=production bin/rake prescat:prune_failed_replication[bn435ff2092,2,false]
pruned zipped moab version 2 on aws_s3_west_2
pruned zipped moab version 2 on ibm_us_south
pruned zipped moab version 2 on aws_s3_east_1

Note that pruning provides a message for each record pruned.

Re-run RAILS_ENV=production bin/rake prescat:diagnose_replication[yer-druid] to confirm that the pruning worked.

Case 1 Remediation: Step 2 - Rerun replication audit

Re-run the replication audit for the druid. We have backfill on by default, so once the database records are removed, the audit will do the replication.

$ RAILS_ENV=production bin/rake prescat:audit:replication_single[bn320qt6030]

This will report errors to Honeybadger.

Re-run RAILS_ENV=production bin/rake prescat:diagnose_replication[yer-druid] to confirm that the replication worked.

Case 2: SOME of the zips for a version are "not found at endpoint" and the zip part status is "unreplicated"

In this case, we only want to remediate replication for the missing endpoint(s) for the version; we can leave the zip_parts with "ok" status and "found at endpoint" alone.

The output from RAILS_ENV=production bin/rake rake prescat:diagnose_replication[dn073hg2651] might look like this:

druid,preserved object version,zipped moab version,endpoint,zip part status,zip part suffix,zipped moab parts count,zip part size,zip part created at,zip part updated at,zip part s3 key,zip part endpoint status
dn073hg2651,1,1,aws_s3_east_1,unreplicated,.zip,1,26567,2023-01-12 21:10:44 UTC,2023-01-12 21:10:44 UTC,dn/073/hg/2651/dn073hg2651.v0001.zip,not found at endpoint
dn073hg2651,1,1,aws_s3_west_2,ok,.zip,1,26567,2023-01-12 21:10:44 UTC,2023-01-12 21:10:44 UTC,dn/073/hg/2651/dn073hg2651.v0001.zip,found at endpoint
dn073hg2651,1,1,ibm_us_south,ok,.zip,1,26567,2023-01-12 21:10:44 UTC,2023-01-12 21:10:44 UTC,dn/073/hg/2651/dn073hg2651.v0001.zip,found at endpoint

Note that 1 of the endpoints has a zip_part status of "unreplicated" and a statuse of "not found at endpoint" while the other two have status of "ok" (and found at endpoint).

Case 2 Remediation: Step 1 - Prune the database record(s)

Prune the database records for the zip_parts that are not at the endpoints.

[TBD: I HAVE NOT TESTED THE RAKE TASK FOR THIS CASE - DOES IT LEAVE THE ONES WITH "ok" STATUS ALONE?] There are three arguments to the rake task: the druid, the zip part version, and whether to assume that an existing record means we shouldn't mess with it (because it might be a replication in progress or in the retry queue on sidekiq).

$ RAILS_ENV=production bin/rake prescat:prune_failed_replication[dn073hg2651,1,true]
pruned zipped moab version 1 on aws_s3_east_1

Note that pruning provides a message for each record pruned.

Re-run RAILS_ENV=production bin/rake prescat:diagnose_replication[yer-druid] to confirm that the pruning worked.

Case 2 Remediation: Step 2 - Rerun replication audit

Re-run the replication audit for the druid. We have backfill on by default, so once the database records are removed, the audit will do the replication.

$ RAILS_ENV=production bin/rake prescat:audit:replication_single[dn073hg2651]

This will report errors to Honeybadger.

Re-run RAILS_ENV=production bin/rake prescat:diagnose_replication[yer-druid] to confirm that the replication worked.

Case 3: a zip_part status is "unreplicated" but it is "found at endpoint" and the checksums match

If a ZipPart status is "unreplicated" but it IS "found at endpoint" and the checksum at the endpoint matches the checksum stored in ZipPart, then we can just change the ZipPart status to "ok."

# in Rails console
> druids = ZipPart.unreplicated.joins(zipped_moab_version: :preserved_object).pluck(:druid).uniq
 => ["bh230tw3168", "bk773sj6920", "bk815hr0743"] 

# in shell, at rails root, look up the diagnosis for the druid:
$ RAILS_ENV=production bin/rake prescat:diagnose_replication[bh230tw3168]
druid,preserved object version,zipped moab version,endpoint,zip part status,zip part suffix,zipped moab parts count,zip part size,zip part md5,zip part id,zip part created at,zip part updated at,zip part s3 key,zip part endpoint status,zip part endpoint md5
bh230tw3168,2,1,aws_s3_east_1,ok,.zip,1,38285937,c2caf028a9da60b0b30a4a8a27cb4b04,261796826,2022-10-31 09:32:40 UTC,2023-04-13 22:30:28 UTC,bh/230/tw/3168/bh230tw3168.v0001.zip,found at endpoint,c2caf028a9da60b0b30a4a8a27cb4b04
bh230tw3168,2,1,aws_s3_west_2,ok,.zip,1,38285937,c2caf028a9da60b0b30a4a8a27cb4b04,261796824,2022-10-31 09:32:39 UTC,2023-04-13 22:30:25 UTC,bh/230/tw/3168/bh230tw3168.v0001.zip,found at endpoint,c2caf028a9da60b0b30a4a8a27cb4b04
bh230tw3168,2,1,ibm_us_south,ok,.zip,1,38285937,c2caf028a9da60b0b30a4a8a27cb4b04,261796825,2022-10-31 09:32:39 UTC,2023-04-13 22:30:31 UTC,bh/230/tw/3168/bh230tw3168.v0001.zip,found at endpoint,c2caf028a9da60b0b30a4a8a27cb4b04
bh230tw3168,2,2,aws_s3_east_1,unreplicated,.zip,1,45844,ef96d8272c1e18a36723e68fb9b34243,262805720,2023-04-13 01:25:54 UTC,2023-04-13 01:25:54 UTC,bh/230/tw/3168/bh230tw3168.v0002.zip,found at endpoint,ef96d8272c1e18a36723e68fb9b34243
bh230tw3168,2,2,aws_s3_west_2,ok,.zip,1,45844,ef96d8272c1e18a36723e68fb9b34243,262805718,2023-04-13 01:25:54 UTC,2023-04-13 22:30:26 UTC,bh/230/tw/3168/bh230tw3168.v0002.zip,found at endpoint,ef96d8272c1e18a36723e68fb9b34243
bh230tw3168,2,2,ibm_us_south,ok,.zip,1,45844,ef96d8272c1e18a36723e68fb9b34243,262805719,2023-04-13 01:25:54 UTC,2023-04-13 22:30:33 UTC,bh/230/tw/3168/bh230tw3168.v0002.zip,found at endpoint,ef96d8272c1e18a36723e68fb9b34243

# back in rails console, compare the checksums above (have a computer do it, not your eyeballs):
> s1 = 'ef96d8272c1e18a36723e68fb9b34243' # <- cut and pasted from ZipPart checksum column
> s2 = 'ef96d8272c1e18a36723e68fb9b34243' # <- cut and pasted from endpoint checksum column
>  s1 == s2
 => true 
# Yay - we are in this case.

# Now:  find the desired ZipPart:
# this is a clunky way, but it works for a small number of druids:
> zz = ZipPart.unreplicated.to_a # find ALL unreplicated ZipParts
zz
 => 
[#<ZipPart:0x000055ea936f9a18
  id: 262805720,
  size: 45844,
  zipped_moab_version_id: 33360781,
  created_at: Thu, 13 Apr 2023 01:25:54.267985000 UTC +00:00,
  updated_at: Thu, 13 Apr 2023 01:25:54.267985000 UTC +00:00,
  md5: "ef96d8272c1e18a36723e68fb9b34243",
  create_info: "{:zip_cmd=>\"zip -r0X -s 10g /sdr-transfers/bh/230/tw/3168/bh230tw3168.v0002.zip bh230tw3168/v0002\", :zip_version=>\"Zip 3.0 (July 5th 2008)\"}",
  parts_count: 1,
  suffix: ".zip",
  status: "unreplicated",
  last_existence_check: nil,
  last_checksum_validation: nil>,
... (more unreplicated zip parts)
> z = ZipPart.find(262805720)  # the ZipPart of interest
> z
 => 
#<ZipPart:0x000055ea91b55de8
 id: 262805720,
 size: 45844,
 zipped_moab_version_id: 33360781,
 created_at: Thu, 13 Apr 2023 01:25:54.267985000 UTC +00:00,
 updated_at: Thu, 13 Apr 2023 01:25:54.267985000 UTC +00:00,
 md5: "ef96d8272c1e18a36723e68fb9b34243",
 create_info: "{:zip_cmd=>\"zip -r0X -s 10g /sdr-transfers/bh/230/tw/3168/bh230tw3168.v0002.zip bh230tw3168/v0002\", :zip_version=>\"Zip 3.0 (July 5th 2008)\"}",
 parts_count: 1,
 suffix: ".zip",
 status: "unreplicated",
 last_existence_check: nil,
 last_checksum_validation: nil> 

# update the status to ok

> z.status = "ok"
 => "ok" 
> z.save
 => true 
# double check the object has been updated
> z
 => 
#<ZipPart:0x000055ea91b55de8
 id: 262805720,
 size: 45844,
 zipped_moab_version_id: 33360781,
 created_at: Thu, 13 Apr 2023 01:25:54.267985000 UTC +00:00,
 updated_at: Mon, 17 Apr 2023 18:19:38.085802759 UTC +00:00,
 md5: "ef96d8272c1e18a36723e68fb9b34243",
 create_info: "{:zip_cmd=>\"zip -r0X -s 10g /sdr-transfers/bh/230/tw/3168/bh230tw3168.v0002.zip bh230tw3168/v0002\", :zip_version=>\"Zip 3.0 (July 5th 2008)\"}",
 parts_count: 1,
 suffix: ".zip",
 status: "ok",
 last_existence_check: nil,
 last_checksum_validation: nil> 

# double check with rake task:


Case 4: a zip_part status is "unreplicated" but it is "found at endpoint" and the checksums don't match

  1. If any of the zip parts are unreplicated but found at the zip endpoint, then request that Ops delete the zips. (Only Ops has permissions to delete. Request deletes by filing an Ops ticket, providing the relevant lines from zip_part_debug_info.)
  2. Run Replication::FailureRemediator using the rake task to prune the database:
RAILS_ENV=production bin/rake prescat:prune_failed_replication[bb001zc5754,9]

By default, this task does not prune any recent ZippedMoabVersions to avoid deleting records which may have jobs currently in process. To override this:

RAILS_ENV=production bin/rake prescat:prune_failed_replication[bb001zc5754,9,true]
  1. Backfill the pruned database records and initiate replication:
RAILS_ENV=production bin/rake prescat:backfill[bb001zc5754]

Replication jobs may take some amount of time to complete.

Clone this wiki locally