-
Notifications
You must be signed in to change notification settings - Fork 2
Fixing a stuck Moab
Problem: a new druid-version has been created by upstream processes, but prescat is not replicating it to archive endpoints.
Note that the version is 3; we know it's version 4 on disk. Note also the status: "invalid_checksum" - for this druid, I think this was a holdover from the checksum mismatch bug.
[1] pry(main)> cm = CompleteMoab.by_druid('fm813sn1247')
=> [#<CompleteMoab:0x0000000005c70e48
id: 405619,
version: 3,
preserved_object_id: 405619,
moab_storage_root_id: 4,
created_at: Sun, 21 Jan 2018 00:02:23 UTC +00:00,
updated_at: Wed, 17 Oct 2018 09:22:35 UTC +00:00,
last_moab_validation: Sun, 21 Jan 2018 00:02:23 UTC +00:00,
last_checksum_validation: Tue, 07 Aug 2018 15:49:05 UTC +00:00,
size: 84231379935,
status: "invalid_checksum",
last_version_audit: Mon, 06 Aug 2018 09:57:50 UTC +00:00,
last_archive_audit: Wed, 17 Oct 2018 09:22:35 UTC +00:00>]
Clear the invalid checksum error by validating again.
[6] pry(main)> CompleteMoab.by_druid('fm813sn1247').each(&:validate_checksums!)
Check cm again here to verify that the checksums did validate now. Output not shown.
We know it's really at version 4; force prescat to look again and update the record.
First, get the storage_location (we know it's on root 4 from the cm object, above).
6] pry(main)> storageroot = MoabStorageRoot.find_by( id: 4)
=> #<MoabStorageRoot:0x0000000006131bd0
id: 4,
name: "services-disk05",
created_at: Thu, 18 Jan 2018 18:55:35 UTC +00:00,
updated_at: Thu, 18 Jan 2018 18:55:35 UTC +00:00,
storage_location: "/services-disk05/sdr2objects">
Now do a synchronous (could also do a perform_later async) catalog check. Note how that third argument in MoabToCatalogJob is constructed from storage_location and a druidtree path based on the druid.
[7] pry(main)> MoabToCatalogJob.perform_now( storageroot, "fm813sn1247", "/services-disk05/sdr2objects/fm/813/sn/1247/fm813sn1247")
Performing MoabToCatalogJob (Job ID: a0abea21-e495-465f-8e2a-536ca5b929fc) from Resque(m2c) with arguments: #<GlobalID:0x0000000006180140 @uri=#<URI::GID gid://preservation-catalog/MoabStorageRoot/4>>, "fm813sn1247", "/services-disk05/sdr2objects/fm/813/sn/1247/fm813sn1247"
check_existence fm813sn1247 called
Enqueued ChecksumValidationJob (Job ID: 335ded51-6121-4987-a1df-01b6dde63697) to Resque(checksum_validation) with arguments: #<GlobalID:0x000000000655bbe8 @uri=#<URI::GID gid://preservation-catalog/CompleteMoab/405619>>
check_existence(fm813sn1247, services-disk05) CompleteMoab status changed from unexpected_version_on_storage to validity_unknown
check_existence(fm813sn1247, services-disk05) actual version (4) greater than CompleteMoab db version (3)
Performed MoabToCatalogJob (Job ID: a0abea21-e495-465f-8e2a-536ca5b929fc) from Resque(m2c) in 653.11ms
=> [{:cm_status_changed=>"CompleteMoab status changed from unexpected_version_on_storage to validity_unknown"},
{:actual_vers_gt_db_obj=>"actual version (4) greater than CompleteMoab db version (3)"}]
Note that M2C picked up the new version number.
Note that the version is now correct and the status is 'ok' -- a new version automatically invokes zipmaker, so the object should have been automatically replicated to endpoints.
[9] pry(main)> cm = CompleteMoab.by_druid('fm813sn1247')
=> #<CompleteMoab:0x000000000312f568
id: 405619,
version: 4,
preserved_object_id: 405619,
moab_storage_root_id: 4,
created_at: Sun, 21 Jan 2018 00:02:23 UTC +00:00,
updated_at: Tue, 06 Nov 2018 22:37:38 UTC +00:00,
last_moab_validation: Tue, 06 Nov 2018 22:37:38 UTC +00:00,
last_checksum_validation: Tue, 06 Nov 2018 22:37:38 UTC +00:00,
size: 84231431008,
status: "ok",
last_version_audit: Tue, 06 Nov 2018 22:37:38 UTC +00:00,
last_archive_audit: Wed, 17 Oct 2018 09:22:35 UTC +00:00>
Note that version 4 was created today, shortly before the last_version_audit timestamp from above.
[11] pry(main)> zmv = ZippedMoabVersion.by_druid("fm813sn1247")
=> [#<ZippedMoabVersion:0x00000000065160c0
id: 8300158,
version: 4,
complete_moab_id: 405619,
zip_endpoint_id: 1,
created_at: Tue, 06 Nov 2018 22:27:54 UTC +00:00,
updated_at: Tue, 06 Nov 2018 22:27:54 UTC +00:00>,
#<ZippedMoabVersion:0x0000000006515f58
id: 7851133,
version: 3,
complete_moab_id: 405619,
zip_endpoint_id: 1,
created_at: Sun, 02 Sep 2018 23:07:11 UTC +00:00,
updated_at: Sun, 02 Sep 2018 23:07:11 UTC +00:00>,
#<ZippedMoabVersion:0x0000000006515d78
id: 7851131,
version: 2,
complete_moab_id: 405619,
zip_endpoint_id: 1,
created_at: Sun, 02 Sep 2018 23:07:11 UTC +00:00,
updated_at: Sun, 02 Sep 2018 23:07:11 UTC +00:00>,
#<ZippedMoabVersion:0x0000000006515c38
id: 7851129,
version: 1,
complete_moab_id: 405619,
zip_endpoint_id: 1,
created_at: Sun, 02 Sep 2018 23:07:11 UTC +00:00,
updated_at: Sun, 02 Sep 2018 23:07:11 UTC +00:00>]
This particular druid had another issue - an existing druid-version-zip in our transfers area from a prior failed attempt to debug the problem. As a result, ZMV did not successfully replicate to the endpoints. Firing off zipmaker manually fixed this.
[12] pry(main)> ZipmakerJob.perform_later('fm813sn1247', 4)
Enqueued ZipmakerJob (Job ID: 91594477-b182-4f47-8566-36c6b6a465fe) to Resque(zipmaker) with arguments: "fm813sn1247", 4
=> #<ZipmakerJob:0x00000000064075a8
@arguments=["fm813sn1247", 4],
@executions=0,
@job_id="91594477-b182-4f47-8566-36c6b6a465fe",
@priority=nil,
@queue_name="zipmaker">
Call replicate! on that specific existing version.
> zmv = ZippedMoabVersion.by_druid('hq932bt8082').find_by(version: 3)
=> #<ZippedMoabVersion:0x000000000461bf58 id: 6404, version: 3, last_existence_check: nil, complete_moab_id: 1401082, zip_endpoint_id: 2, created_at: Mon, 30 Jul 2018 16:59:51 UTC +00:00, updated_at: Mon, 30 Jul 2018 16:59:51 UTC +00:00, status: "unreplicated">
> zmv.replicate!
Enqueued ZipmakerJob (Job ID: 53eafb1e-e5fc-43b9-9ec0-feabb5e330a9) to Resque(zipmaker) with arguments: "hq932bt8082", 1
=> #<ZipmakerJob:0x0000000005df7aa0 @arguments=["hq932bt8082", 1], @executions=0, @job_id="53eafb1e-e5fc-43b9-9ec0-feabb5e330a9", @priority=nil, @queue_name="zipmaker">
AWS CLI then verified that all expected zips were present on the endpoint.
AWS CLI output:
fm/813/sn/1247/fm813sn1247.v0001.z01
fm/813/sn/1247/fm813sn1247.v0001.z02
fm/813/sn/1247/fm813sn1247.v0001.z03
fm/813/sn/1247/fm813sn1247.v0001.z04
fm/813/sn/1247/fm813sn1247.v0001.z05
fm/813/sn/1247/fm813sn1247.v0001.z06
fm/813/sn/1247/fm813sn1247.v0001.z07
fm/813/sn/1247/fm813sn1247.v0001.zip
fm/813/sn/1247/fm813sn1247.v0002.zip
fm/813/sn/1247/fm813sn1247.v0003.zip
fm/813/sn/1247/fm813sn1247.v0004.zip
Here's an interesting one. PO for a druid says it's version 3. CM says it's version 5. Disk says it's version 6.
[13] pry(main)> po = PreservedObject.find_by(druid: 'kf921gd3855')
=> #<PreservedObject:0x000000000428e750
id: 1415209,
druid: "kf921gd3855",
current_version: 3,
created_at: Wed, 05 Sep 2018 02:39:49 UTC +00:00,
updated_at: Tue, 18 Sep 2018 20:04:16 UTC +00:00,
preservation_policy_id: 1>
[17] pry(main)> cm = CompleteMoab.by_druid('kf921gd3855')
=> [#<CompleteMoab:0x0000000006429ba8
id: 1415217,
version: 5,
preserved_object_id: 1415209,
moab_storage_root_id: 14,
created_at: Wed, 05 Sep 2018 02:39:49 UTC +00:00,
updated_at: Mon, 10 Dec 2018 07:33:02 UTC +00:00,
last_moab_validation: Mon, 10 Dec 2018 07:33:02 UTC +00:00,
last_checksum_validation: Mon, 10 Dec 2018 07:33:01 UTC +00:00,
size: 108335,
status: "ok",
last_version_audit: Mon, 10 Dec 2018 07:33:02 UTC +00:00,
last_archive_audit: Wed, 17 Oct 2018 14:43:46 UTC +00:00>]
The fix: tell PO it's version 5 (matching CM) then do M2C.
[19] pry(main)> po.current_version = 5
=> 5
[20] pry(main)> po.save!
=> true
[21] pry(main)> MoabToCatalogJob.perform_now( storageroot, "kf921gd3855", "/services-disk15/sdr2objects/kf/921/gd/3855/kf921gd3855")
Performing MoabToCatalogJob (Job ID: 3e5e6732-cc8c-4c0f-8070-6f78b65933cc) from Resque(m2c) with arguments: #<GlobalID:0x0000000005b29be8 @uri=#<URI::GID gid://preservation-catalog/MoabStorageRoot/14>>, "kf921gd3855", "/services-disk15/sdr2objects/kf/921/gd/3855/kf921gd3855"
check_existence kf921gd3855 called
Enqueued ZipmakerJob (Job ID: 0bff41fa-d578-4ab9-8b7e-420b4124acb7) to Resque(zipmaker) with arguments: "kf921gd3855", 6
check_existence(kf921gd3855, services-disk15) actual version (6) greater than CompleteMoab db version (5)
Performed MoabToCatalogJob (Job ID: 3e5e6732-cc8c-4c0f-8070-6f78b65933cc) from Resque(m2c) in 674.09ms
=> [{:actual_vers_gt_db_obj=>"actual version (6) greater than CompleteMoab db version (5)"}]
Note that fired off ZipMaker.
- Replication errors
- Validate moab step fails during preservationIngestWF
- ZipmakerJob failures
- Moab Audit Failures
- Ceph Errors
- Job queues
- Deposit bag was missing
- ActiveRecord and Replication intro
- 2018 Work Cycle Documentation
- Fixing a stuck Moab
- Adding a new cloud provider
- Audits (how to run as needed)
- Extracting segmented zipfiles
- AWS credentials, S3 configuration
- Zip Creation
- Storage Migration Additional Information
- Useful ActiveRecord queries
- IO against Ceph backed preservation storage is hanging indefinitely (steps to address IO problems, and follow on cleanup)