Home

Preservation Catalog

Welcome to the Preservation Catalog wiki! Preservation Catalog, or "PresCat," is a Rails application that tracks, audits, and replicates archival artifacts associated with objects deposited into the Stanford Digital Repository.

See the sidebar to drill down and find the documentation you seek.

Note to wiki editors: if you add a page, please remember to add a link into the sidebar for easier discovery and navigation! Please also update the troubleshooting guide below if an added page is especially relevant to triaging production problems.

Note to readers: if you don't see what you're looking for in the sidebar's Table of Contents, browse/search Github's autogen ToC in the "Pages" section above the handmade ToC (this links to all pages in the wiki).

Help! I don't know what I'm looking for, but I have to troubleshoot something.

Asynchronous Job Failures

How you might notice this problem

Honeybadger alerts on code in app/jobs/
entries in a Resque web console failure queue -- e.g. zipmaker_failed, *_delivery_failed (e.g. s3_us_west_2_delivery_failed), zip_endpoint_events_failed, etc.

Useful troubleshooting links

Investigating failed Resque Jobs -- programmatically examining failures in general, cleaning up zipmaker_failed queue in particular
Error reading file content or metadata from Ceph backed storage roots -- the cause of many a zipmaker_failed entry
Cleaning up delivery failures
Fixing an Incomplete Moab Upload -- for large (>10 GB) Moab versions split into multiple ZipParts

Audit Failures

An audit job (scheduled or manually triggered) examines an on prem or cloud copy, and detects possibly missing or corrupt data.

How you might notice this problem

Honeybadger sends and alert stating an audit of a particular druid (e.g. checksum validation, part replication audit) determined that there is a problem with the Moab and/or its expected cloud copies.
An audit is run manually, and the status of the relevant CompleteMoabs or ZipParts is queried when the job completes, and is found not to be ok.

Useful troubleshooting links

Validations for Moabs
Audit Failures (Checksum Validation, etc)
Investigating a druid with replication errors
Fixing an Incomplete Moab Upload -- for large (>10 GB) Moab versions split into multiple ZipParts
Cleaning up delivery failures
Fixing a stuck Moab a bit old/specific, but useful example
A Moab Has Moved

Something is wrong with the job system (unexpected worker count, resque-pool stability problems, etc)

Too many workers

How you might notice this problem

nagios alert that worker count is too high ("feature-worker-count: FAILED TOO MANY WORKERS")
notice that worker count is too high when manually visiting the Resque web console or the okcomputer status page (/status/all)

Useful troubleshooting links

More than the expected number of Resque workers are running

Too few workers

is resque-pool currently running on all of the worker boxes?
at times we've under-resourced QA and stage relative to the number of workers running, causing worker processes and/or resque-pool to crash intermittently (especially when re-starting while workers have jobs in progress). consider allocating fewer workers or more computing resources, depending on whether all workers are actually being utilized.

resque-pool seems to be crashing periodically, not restarting correctly, etc

How you might notice this problem

only a fraction (e.g. 2/3) of the expected workers are up
worker counts fluctuate every few minutes or hours by the number of workers expected to be running on one VM
resque-pool hotswap will be invoked (either directly, or by way of deployment) and the worker count will be fine, but then some number of minutes or hours later, it is noticed that one or more worker VMs do not have resque-pool running anymore.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Home

Preservation Catalog

Help! I don't know what I'm looking for, but I have to troubleshoot something.

Asynchronous Job Failures

How you might notice this problem

Useful troubleshooting links

Audit Failures

How you might notice this problem

Useful troubleshooting links

Something is wrong with the job system (unexpected worker count, resque-pool stability problems, etc)

Too many workers

How you might notice this problem

Useful troubleshooting links

Too few workers

resque-pool seems to be crashing periodically, not restarting correctly, etc

How you might notice this problem

Useful troubleshooting links

Troubleshooting

About PresCat

Development

Other operations

Legacy (not maintained)

Clone this wiki locally