-
Notifications
You must be signed in to change notification settings - Fork 2
Home
Welcome to the Preservation Catalog wiki! Preservation Catalog, or "PresCat," is a Rails application that tracks, audits, and replicates archival artifacts associated with objects deposited into the Stanford Digital Repository.
See the sidebar to drill down and find the documentation you seek.
Note to wiki editors: if you add a page, please remember to add a link into the sidebar for easier discovery and navigation! Please also update the troubleshooting guide below if an added page is especially relevant to triaging production problems.
Note to readers: if you don't see what you're looking for in the sidebar's Table of Contents, browse/search Github's autogen ToC in the "Pages" section above the handmade ToC (this links to all pages in the wiki, and is collapsed by default).
- Look at Troubleshooting section in sidebar -->
- Sorry, you have to keep reading (or search the wiki page(s))
A job throws an error while executing.
- Honeybadger alerts about an uncaught error in code from
app/jobs/
- entries in a Resque web console failure queue -- e.g.
zipmaker_failed
,*_delivery_failed
(e.g.s3_us_west_2_delivery_failed
),zip_endpoint_events_failed
, etc.
-
Investigating failed Resque Jobs -- programmatically examining failures in general, cleaning up
zipmaker_failed
queue in particular -
Error reading file content or metadata from Ceph backed storage roots -- the cause of many a
zipmaker_failed
entry - Cleaning up delivery failures
-
Fixing an Incomplete Moab Upload -- for large (>10 GB) Moab versions split into multiple
ZipPart
s -
zip_endpoint_events_failed
-- If you can't turn up any issues with replication using this debugging code, and the error was on a failed call to the event service (e.g. due to blip in our network), it's safe to disregard the error and drop it from the failure queue (tolerable rough edge: the dor-services-app event log will be missing an entry for the successful replication).
An audit job (scheduled or manually triggered) examines an on prem or cloud copy, and detects possibly missing or corrupt data.
Note: Here we mean detection of a problem with content, where the job doing the audit work completes execution successfully from an ActiveJob perspective.
- Honeybadger sends an alert stating an audit of a particular druid (e.g. checksum validation, part replication audit) determined that there is a problem with the Moab and/or its expected cloud copies.
- An audit is run manually, and the status of the relevant
CompleteMoab
s orZipPart
s is queried when the job completes, and is found not to beok
.
- Audits (high level)
- Audit Failures (Checksum Validation, etc)
-
Investigating a druid with replication errors -- e.g. an alert like
PartReplicationAuditJob(druid:bc123df4567, ibm_us_south) 1 on ibm_us_south: not all ZippedMoabVersion parts are replicated yet
-
Fixing an Incomplete Moab Upload -- for large (>10 GB) Moab versions split into multiple
ZipPart
s - Cleaning up delivery failures
- Fixing a stuck Moab a bit old/specific, but useful example
- A Moab Has Moved
Something is wrong with the job system (unexpected worker count, resque-pool stability problems, etc)
- nagios alert that worker count is too high ("feature-worker-count: FAILED TOO MANY WORKERS")
- notice that worker count is too high when manually visiting the Resque web console or the okcomputer status page (
/status/all
)
- is resque-pool currently running on all of the worker boxes?
- at times we've under-resourced QA and stage relative to the number of workers running, causing worker processes and/or resque-pool to crash intermittently (especially when re-starting while workers have jobs in progress). consider allocating fewer workers or more computing resources, depending on whether all workers are actually being utilized.
- only a fraction (e.g. 2/3) of the expected workers are up
- worker counts fluctuate every few minutes or hours by the number of workers expected to be running on one VM
- resque-pool hotswap will be invoked (either directly, or by way of deployment) and the worker count will be fine, but then some number of minutes or hours later, it is noticed that one or more worker VMs do not have resque-pool running anymore.
- https://github.com/sul-dlss/DevOpsDocs/blob/master/projects/preservation/preservation_catalog/pres_catalog-ops-concerns.md#to-restart-the-workers-via-capistrano
- https://github.com/sul-dlss/DevOpsDocs/blob/master/projects/preservation/preservation_catalog/pres_catalog-ops-concerns.md#to-restart-the-worker-pool-on-prod-020304--stage-02--qa-02
E.g. reads or writes against the preservation storage roots are hanging indefinitely, even for small Moabs; or there are many file read attempts against preservation content that are failing immediately, e.g. with unexpected permission or file metadata errors.
- Replication errors
- Validate moab step fails during preservationIngestWF
- ZipmakerJob failures
- Moab Audit Failures
- Ceph Errors
- Job queues
- Deposit bag was missing
- ActiveRecord and Replication intro
- 2018 Work Cycle Documentation
- Fixing a stuck Moab
- Adding a new cloud provider
- Audits (how to run as needed)
- Extracting segmented zipfiles
- AWS credentials, S3 configuration
- Zip Creation
- Storage Migration Additional Information
- Useful ActiveRecord queries
- IO against Ceph backed preservation storage is hanging indefinitely (steps to address IO problems, and follow on cleanup)