-
Notifications
You must be signed in to change notification settings - Fork 2
Replication errors
Each moab has multiple versions. A zip of each versions should be replicated to 3 endpoints. The zip for a version may be split into multiple parts depending on size, meaning a single version could have multiple zip files.
Run this from the rails console:
druids = ZipPart.unreplicated.joins(zipped_moab_version: :preserved_object).pluck(:druid).uniq
Run RAILS_ENV=production bin/rake prescat:diagnose_replication[(druid)]
using the rake task to get the replication state:
RAILS_ENV=production bin/rake prescat:diagnose_replication[dn073hg2651]
- If for a given endpoint and version, all are "ok" then the replication is correct.
druid,preserved object version,zipped moab version,endpoint,zip part status,zip part suffix,zipped moab parts count,zip part size,zip part created at,zip part updated at,zip part s3 key,zip part endpoint status
dn073hg2651,1,1,aws_s3_east_1,ok,.zip,1,26567,2023-01-12 21:10:44 UTC,2023-01-12 21:10:44 UTC,dn/073/hg/2651/dn073hg2651.v0001.zip,found at endpoint
dn073hg2651,1,1,aws_s3_west_2,ok,.zip,1,26567,2023-01-12 21:10:44 UTC,2023-01-12 21:10:44 UTC,dn/073/hg/2651/dn073hg2651.v0001.zip,found at endpoint
dn073hg2651,1,1,ibm_us_south,ok,.zip,1,26567,2023-01-12 21:10:44 UTC,2023-01-12 21:10:44 UTC,dn/073/hg/2651/dn073hg2651.v0001.zip,found at endpoint
- If for a given endpoint and version, all or some are "unreplicated" then it is unreplicated for that endpoint and version.
druid,preserved object version,zipped moab version,endpoint,zip part status,zip part suffix,zipped moab parts count,zip part size,zip part created at,zip part updated at,zip part s3 key,zip part endpoint status
dn073hg2651,1,1,aws_s3_east_1,unreplicated,.zip,1,26567,2023-01-12 21:10:44 UTC,2023-01-12 21:10:44 UTC,dn/073/hg/2651/dn073hg2651.v0001.zip,not found at endpoint
dn073hg2651,1,1,aws_s3_west_2,ok,.zip,1,26567,2023-01-12 21:10:44 UTC,2023-01-12 21:10:44 UTC,dn/073/hg/2651/dn073hg2651.v0001.zip,found at endpoint
dn073hg2651,1,1,ibm_us_south,unreplicated,.zip,1,26567,2023-01-12 21:10:44 UTC,2023-01-12 21:10:44 UTC,dn/073/hg/2651/dn073hg2651.v0001.zip,found at endpoint
- If for a given version, an endpoint is missing then it is "unreplicated" for that endpoint and version.
druid,preserved object version,zipped moab version,endpoint,zip part status,zip part suffix,zipped moab parts count,zip part size,zip part created at,zip part updated at,zip part s3 key,zip part endpoint status
dn073hg2651,1,1,aws_s3_east_1,ok,.zip,1,26567,2023-01-12 21:10:44 UTC,2023-01-12 21:10:44 UTC,dn/073/hg/2651/dn073hg2651.v0001.zip,found at endpoint
dn073hg2651,1,1,ibm_us_south,ok,.zip,1,26567,2023-01-12 21:10:44 UTC,2023-01-12 21:10:44 UTC,dn/073/hg/2651/dn073hg2651.v0001.zip,found at endpoint
Sometimes the problem may resolve on its own. To verify this:
Re-run diagnose_replication
:
RAILS_ENV=production bin/rake prescat:diagnose_replication[dn073hg2651]
OR
Run a Moab Replication Audit (which will report problems to Honeybadger):
RAILS_ENV=production bin/rake prescat:audit:replication_single[fd812vz8360]
Diagnosis can also be performed in bulk from the Rails console using a list of druids. The following gets debug info for unreplicated zip parts:
druids = ZipPart.unreplicated.joins(zipped_moab_version: :preserved_object).pluck(:druid).uniq
debug_infos = Audit::ReplicationSupport.zip_part_debug_info(druids)
# Wait for it ...
CSV.open('debug_info.csv', 'wb') {|csv| debug_infos.each {|debug_info| csv << debug_info }}
The column names are not provided, they are in app/services/audit/replication_support.rb, and are the same as for the diagnose_replication task:
druid,preserved object version,zipped moab version,endpoint,zip part status,zip part suffix,zipped moab parts count,zip part size,zip part md5,zip part id,zip part created at,zip part updated at,zip part s3 key,zip part endpoint status,zip part endpoint md5
Note that with Sidekiq, retries of replication failures are automatic. If a ZipPart is "not found at endpoint" per the diagnose_replication rake tasks, and if the unreplicated ZipParts are "recent" (is this 1 week? how long do sidekiq retries keep trying?), then it's likely the problem will resolve itself. The ZipPart datestamps will indicate how recently the record has been updated.
Case 1: ALL of the zips for a version are "not found at endpoint" and the zip part status is "unreplicated"
The output from rake prescat:diagnose_replication
might look like this:
$ RAILS_ENV=production bin/rake prescat:diagnose_replication[bn435ff2092]
druid,preserved object version,zipped moab version,endpoint,zip part status,zip part suffix,zipped moab parts count,zip part size,zip part md5,zip part id,zip part created at,zip part updated at,zip part s3 key,zip part endpoint status,zip part endpoint md5
bn435ff2092,2,1,aws_s3_east_1,ok,.zip,1,266315221,f3df854dced7c26a1ac31285a1cf3cad,259468306,2022-05-23 20:48:49 UTC,2023-04-13 22:31:49 UTC,bn/435/ff/2092/bn435ff2092.v0001.zip,found at endpoint,f3df854dced7c26a1ac31285a1cf3cad
bn435ff2092,2,1,aws_s3_west_2,ok,.zip,1,266315221,f3df854dced7c26a1ac31285a1cf3cad,259468305,2022-05-23 20:48:49 UTC,2023-04-13 22:31:48 UTC,bn/435/ff/2092/bn435ff2092.v0001.zip,found at endpoint,f3df854dced7c26a1ac31285a1cf3cad
bn435ff2092,2,1,ibm_us_south,ok,.zip,1,266315221,f3df854dced7c26a1ac31285a1cf3cad,259468308,2022-05-23 20:48:49 UTC,2023-04-13 22:31:50 UTC,bn/435/ff/2092/bn435ff2092.v0001.zip,found at endpoint,f3df854dced7c26a1ac31285a1cf3cad
bn435ff2092,2,2,aws_s3_east_1,unreplicated,.zip,1,45248,00500cdeb23c3412cbd300296e7df0d4,262892115,2023-04-13 04:05:30 UTC,2023-04-13 04:05:30 UTC,bn/435/ff/2092/bn435ff2092.v0002.zip,not found at endpoint,
bn435ff2092,2,2,aws_s3_west_2,unreplicated,.zip,1,45248,00500cdeb23c3412cbd300296e7df0d4,262892109,2023-04-13 04:05:30 UTC,2023-04-13 04:05:30 UTC,bn/435/ff/2092/bn435ff2092.v0002.zip,not found at endpoint,
bn435ff2092,2,2,ibm_us_south,unreplicated,.zip,1,45248,00500cdeb23c3412cbd300296e7df0d4,262892112,2023-04-13 04:05:30 UTC,2023-04-13 04:05:30 UTC,bn/435/ff/2092/bn435ff2092.v0002.zip,not found at endpoint,
Note that in this case, zipped moab version 1 has status "ok" for all endpoints for the first version. Zipped moab version 2 has status "unreplicated" for all endpoints and the zip endpoint status is "not found at endpoint" for all endpoints as well.
Prune the database records for the zip_parts that are not at the endpoints.
There are three arguments to the rake task: the druid, the zip part version, and whether to assume that an existing record means we shouldn't mess with it (because it might be a replication in progress or in the retry queue on sidekiq). By default, this task does not prune any recent ZippedMoabVersions to avoid deleting records which may have jobs currently in process. To override this, include "false" as the last argument:
$ RAILS_ENV=production bin/rake prescat:prune_failed_replication[bn435ff2092,2,false]
pruned zipped moab version 2 on aws_s3_west_2
pruned zipped moab version 2 on ibm_us_south
pruned zipped moab version 2 on aws_s3_east_1
Note that pruning provides a message for each record pruned.
Re-run RAILS_ENV=production bin/rake prescat:diagnose_replication[yer-druid]
to confirm that the pruning worked.
Re-run the replication audit for the druid. We have backfill on by default, so once the database records are removed, the audit will do the replication.
$ RAILS_ENV=production bin/rake prescat:audit:replication_single[bn320qt6030]
This will report errors to Honeybadger.
NOTE: If there are a lot of versions in your object, or if the files are large (there is a size column in diagose_replication), you might want to just run the backfill needed with this rake task:
RAILS_ENV=production bin/rake prescat:backfill[bb001zc5754]
Replication jobs may take some amount of time to complete.
Re-run RAILS_ENV=production bin/rake prescat:diagnose_replication[yer-druid]
to confirm that the replication worked.
Case 2: SOME of the zips for a version are "not found at endpoint" and the zip part status is "unreplicated"
In this case, we only want to remediate replication for the missing endpoint(s) for the version; we can leave the zip_parts with "ok" status and "found at endpoint" alone.
The output from RAILS_ENV=production bin/rake rake prescat:diagnose_replication[dn073hg2651]
might look like this:
druid,preserved object version,zipped moab version,endpoint,zip part status,zip part suffix,zipped moab parts count,zip part size,zip part created at,zip part updated at,zip part s3 key,zip part endpoint status
dn073hg2651,1,1,aws_s3_east_1,unreplicated,.zip,1,26567,2023-01-12 21:10:44 UTC,2023-01-12 21:10:44 UTC,dn/073/hg/2651/dn073hg2651.v0001.zip,not found at endpoint
dn073hg2651,1,1,aws_s3_west_2,ok,.zip,1,26567,2023-01-12 21:10:44 UTC,2023-01-12 21:10:44 UTC,dn/073/hg/2651/dn073hg2651.v0001.zip,found at endpoint
dn073hg2651,1,1,ibm_us_south,ok,.zip,1,26567,2023-01-12 21:10:44 UTC,2023-01-12 21:10:44 UTC,dn/073/hg/2651/dn073hg2651.v0001.zip,found at endpoint
Note that 1 of the endpoints has a ZipPart status of "unreplicated" and a status of "not found at endpoint" while the other two have status of "ok" (and found at endpoint).
[TBD: THIS MAY BE THE SAME AS CASE 1 IF THE DATABASE PRUNING AND THE BACKFILL TASKS LEAVE THE 'ok' ZIP_PARTS ALONE]
Prune the database records for the zip_parts that are not at the endpoints.
[TBD: I HAVE NOT TESTED THE RAKE TASK FOR THIS CASE - WILL IT LEAVE THE ONES WITH "ok" STATUS ALONE?]
There are three arguments to the rake task: the druid, the zip part version, and whether to assume that an existing record means we shouldn't mess with it (because it might be a replication in progress or in the retry queue on sidekiq). By default, this task does not prune any recent ZippedMoabVersions to avoid deleting records which may have jobs currently in process. To override this, include "false" as the last argument:
$ RAILS_ENV=production bin/rake prescat:prune_failed_replication[dn073hg2651,1,false]
pruned zipped moab version 1 on aws_s3_east_1
Note that pruning provides a message for each record pruned.
Re-run RAILS_ENV=production bin/rake prescat:diagnose_replication[yer-druid]
to confirm that the pruning worked.
Re-run the replication audit for the druid. We have backfill on by default, so once the database records are removed, the audit will do the replication.
$ RAILS_ENV=production bin/rake prescat:audit:replication_single[dn073hg2651]
This will report errors to Honeybadger.
[TBD: I HAVE NOT TESTED THE RAKE TASK BELOW FOR THIS CASE - WILL IT LEAVE THE ONES WITH "ok" STATUS ALONE?]
NOTE: If there are a lot of versions in your object, or if the files are large (there is a size column in diagose_replication), you might want to just run the backfill needed with this rake task:
RAILS_ENV=production bin/rake prescat:backfill[bb001zc5754]
Replication jobs may take some amount of time to complete.
Re-run RAILS_ENV=production bin/rake prescat:diagnose_replication[yer-druid]
to confirm that the replication worked.
If a ZipPart status is "unreplicated" but it IS "found at endpoint" and the checksum at the endpoint matches the checksum stored in ZipPart, then we can just change the ZipPart status to "ok."
# in shell, at rails root, look up the diagnosis for the druid:
$ RAILS_ENV=production bin/rake prescat:diagnose_replication[bh230tw3168]
druid,preserved object version,zipped moab version,endpoint,zip part status,zip part suffix,zipped moab parts count,zip part size,zip part md5,zip part id,zip part created at,zip part updated at,zip part s3 key,zip part endpoint status,zip part endpoint md5
bh230tw3168,2,1,aws_s3_east_1,ok,.zip,1,38285937,c2caf028a9da60b0b30a4a8a27cb4b04,261796826,2022-10-31 09:32:40 UTC,2023-04-13 22:30:28 UTC,bh/230/tw/3168/bh230tw3168.v0001.zip,found at endpoint,c2caf028a9da60b0b30a4a8a27cb4b04
bh230tw3168,2,1,aws_s3_west_2,ok,.zip,1,38285937,c2caf028a9da60b0b30a4a8a27cb4b04,261796824,2022-10-31 09:32:39 UTC,2023-04-13 22:30:25 UTC,bh/230/tw/3168/bh230tw3168.v0001.zip,found at endpoint,c2caf028a9da60b0b30a4a8a27cb4b04
bh230tw3168,2,1,ibm_us_south,ok,.zip,1,38285937,c2caf028a9da60b0b30a4a8a27cb4b04,261796825,2022-10-31 09:32:39 UTC,2023-04-13 22:30:31 UTC,bh/230/tw/3168/bh230tw3168.v0001.zip,found at endpoint,c2caf028a9da60b0b30a4a8a27cb4b04
bh230tw3168,2,2,aws_s3_east_1,unreplicated,.zip,1,45844,ef96d8272c1e18a36723e68fb9b34243,262805720,2023-04-13 01:25:54 UTC,2023-04-13 01:25:54 UTC,bh/230/tw/3168/bh230tw3168.v0002.zip,found at endpoint,ef96d8272c1e18a36723e68fb9b34243
bh230tw3168,2,2,aws_s3_west_2,ok,.zip,1,45844,ef96d8272c1e18a36723e68fb9b34243,262805718,2023-04-13 01:25:54 UTC,2023-04-13 22:30:26 UTC,bh/230/tw/3168/bh230tw3168.v0002.zip,found at endpoint,ef96d8272c1e18a36723e68fb9b34243
bh230tw3168,2,2,ibm_us_south,ok,.zip,1,45844,ef96d8272c1e18a36723e68fb9b34243,262805719,2023-04-13 01:25:54 UTC,2023-04-13 22:30:33 UTC,bh/230/tw/3168/bh230tw3168.v0002.zip,found at endpoint,ef96d8272c1e18a36723e68fb9b34243
Use a computer to comparse the checksums, not your eyeballs. You can do it in the rails console:
> s1 = 'ef96d8272c1e18a36723e68fb9b34243' # <- cut and pasted from ZipPart checksum column
> s2 = 'ef96d8272c1e18a36723e68fb9b34243' # <- cut and pasted from endpoint checksum column
> s1 == s2
=> true
If the checksums do not match, it is not a Case 3 problem - go to Case 4.
The ZipPart id is in the diagnose_replication output for the druid. Let's say it is 262805720.
From the rails console:
> z = ZipPart.find(262805720)
=>
#<ZipPart:0x000055ea91b55de8
id: 262805720,
size: 45844,
zipped_moab_version_id: 33360781,
created_at: Thu, 13 Apr 2023 01:25:54.267985000 UTC +00:00,
updated_at: Thu, 13 Apr 2023 01:25:54.267985000 UTC +00:00,
md5: "ef96d8272c1e18a36723e68fb9b34243",
create_info: "{:zip_cmd=>\"zip -r0X -s 10g /sdr-transfers/bh/230/tw/3168/bh230tw3168.v0002.zip bh230tw3168/v0002\", :zip_version=>\"Zip 3.0 (July 5th 2008)\"}",
parts_count: 1,
suffix: ".zip",
status: "unreplicated",
last_existence_check: nil,
last_checksum_validation: nil>
> z.status = "ok" # <-- updating the status to 'ok'
=> "ok"
> z.save # <-- saving the updated status
=> true
> z = ZipPart.find(262805720) # <-- double checking the object has been updated
=>
#<ZipPart:0x000055ea91b55de8
id: 262805720,
size: 45844,
zipped_moab_version_id: 33360781,
created_at: Thu, 13 Apr 2023 01:25:54.267985000 UTC +00:00,
updated_at: Thu, 13 Apr 2023 01:25:54.267985000 UTC +00:00,
md5: "ef96d8272c1e18a36723e68fb9b34243",
create_info: "{:zip_cmd=>\"zip -r0X -s 10g /sdr-transfers/bh/230/tw/3168/bh230tw3168.v0002.zip bh230tw3168/v0002\", :zip_version=>\"Zip 3.0 (July 5th 2008)\"}",
parts_count: 1,
suffix: ".zip",
status: "ok",
last_existence_check: nil,
last_checksum_validation: nil>
You can also double check with the diagnose_replication rake task.
Case 4: a ZipPart status is "unreplicated" but it is "found at endpoint" and the checksums don't match
If any of the zip parts are unreplicated but found at the zip endpoint, then request that Ops delete the zips.
Only Ops has permissions to delete. Request deletes by filing an Ops ticket, providing the relevant lines from zip_part_debug_info
. The instructions for getting debug_info are in the Diagnosis section above under the heading 'Diagnosis for a list of Druids'.
Prune the database records for the zip_parts that are not at the endpoints.
There is a rake task for this, but it will delete ZipPart and ZippedMoabVersion for ALL the endpoint. If you don't wish to do this, you will need to chase the code to determine how to remove the database records only for the endpoint(s) needed.
There are three arguments to the rake task: the druid, the zip part version, and whether to assume that an existing record means we shouldn't mess with it (because it might be a replication in progress or in the retry queue on sidekiq). By default, this task does not prune any recent ZippedMoabVersions to avoid deleting records which may have jobs currently in process. To override this, include "false" as the last argument:
$ RAILS_ENV=production bin/rake prescat:prune_failed_replication[bn435ff2092,2,false]
pruned zipped moab version 2 on aws_s3_west_2
pruned zipped moab version 2 on ibm_us_south
pruned zipped moab version 2 on aws_s3_east_1
Note that pruning provides a message for each record pruned.
Re-run RAILS_ENV=production bin/rake prescat:diagnose_replication[yer-druid]
to confirm that the pruning worked.
Re-run the replication audit for the druid with rake. We have backfill on by default, so once the database records are removed, the audit will do the replication.
$ RAILS_ENV=production bin/rake prescat:audit:replication_single[dn073hg2651]
This will report errors to Honeybadger.
NOTE: If there are a lot of versions in your object, or if the files are large (there is a size column in diagose_replication), you might want to just run the backfill needed with this rake task:
RAILS_ENV=production bin/rake prescat:backfill[bb001zc5754]
Replication jobs may take some amount of time to complete.
Re-run RAILS_ENV=production bin/rake prescat:diagnose_replication[yer-druid]
to confirm that the replication worked.
- Replication errors
- Validate moab step fails during preservationIngestWF
- ZipmakerJob failures
- Moab Audit Failures
- Ceph Errors
- Job queues
- Deposit bag was missing
- ActiveRecord and Replication intro
- 2018 Work Cycle Documentation
- Fixing a stuck Moab
- Adding a new cloud provider
- Audits (how to run as needed)
- Extracting segmented zipfiles
- AWS credentials, S3 configuration
- Zip Creation
- Storage Migration Additional Information
- Useful ActiveRecord queries
- IO against Ceph backed preservation storage is hanging indefinitely (steps to address IO problems, and follow on cleanup)