-
Notifications
You must be signed in to change notification settings - Fork 2
Ceph Errors
Justin Littman edited this page Jan 9, 2023
·
2 revisions
- HoneyBadger "Errno::EACCES: Permission denied @ rb_sysopen - /pres-0n/sdr2objects/ ..." (which could happen in DorServicesApp as part of "ShelveJob" step for accessioning, or in PreservationCatalog as part of update-moab step.)
- "Preservation::Client.http_response got 500" in other parts of SDR.
- Someone who does accessioning complains that it's taking an inordinately long time for a newly accessioned version to finish going through the pipeline. Most objects that are 100 MB or less should take no more than a minute or two to be processed by any given
accessionWF
orpreservationIngestWF
. Similarly, an object that is 10+ GB may take take tens of minutes at any given step involving preservation storage IO, and it would be typical for a very large object (e.g. a 1 TB media object) to take many hours to a day+. - Preservation storage related jobs (e.g.
UpdateMoab
on preservation robots, orvalidate_moab
on preservation catalog) are taking an inordinately long time (e.g. anUpdateMoab
orvalidate_moab
job for a 100 MB Moab has been running for 2 hours). - Ops notices Ceph slow metadata service (MDS) request alerts, urging us to look for one of the above manifestations of the issue.
./bq499mh5981/v0010/data/metadata:
ls: cannot access ./bq499mh5981/v0010/data/metadata/workflows.xml: Permission denied
ls: cannot access ./bq499mh5981/v0010/data/metadata/events.xml: Permission denied
ls: cannot access ./bq499mh5981/v0010/data/metadata/versionMetadata.xml: Permission denied
total 3
-rw-r--r-- 1 pres pres 2138 May 28 09:37 descMetadata.xml
-????????? ? ? ? ? ? events.xml
-rw-r--r-- 1 pres pres 263 May 28 09:37 provenanceMetadata.xml
-????????? ? ? ? ? ? versionMetadata.xml
-????????? ? ? ? ? ? workflows.xml
In addition, you can work with ops or Andrew to determine whether any file IO is actually happening, either by using Linux CLI tools to examine disk and CPU activity of relevant worker processes; in the case of operations like TransferObject
and UpdateMoab
, you might also use du
to see if bytes are still being written to the target storage.
- Wait for Ceph to recover and/or a jam to resolve (at least 15 minutes).
- Ask Ops to see if the cluster looks healthy overall and/or restart the Ceph metadata service. After the restart, watch the queues and determine per the above guidance whether the blockage has been resolved. Sometimes an MDS failover or the targeted termination of a worker process will clear up the jam entirely.
- Kill worker processes for stuck jobs. (For the robots, these will be Resque workers; for PresCat, these will be Sidekiq workers.)
- After killing workers, check for corruption if the process killed is a pres robots worker performing
UpdateMoab
orTransferObject
. The corrupted Moabs will fail checksum validation in pres cat, and will likely also have an error at one of thepreservationIngestWF
steps. The workflow grid (esp failures attransfer-object
,update-moab
, orvalidate-moab
inpreservationIngestWF
; but also errors inpreservationAuditWF
) is a good indicator of which objects need attention and possibly manual remediation. There may also be complaints about a mismatch between expected and actual versions.
- After killing workers, check for corruption if the process killed is a pres robots worker performing
Sometimes an MDS failover or the targeted termination of a worker process will clear up the jam entirely. Sometimes it will clear things for a few minutes, only for things to get stuck again. You'll have to watch the Resque consoles for a few minutes after either of the above measures and determine which is the case.
- Warn
#dlss-aaas
that infra and ops are aware that things are stuck, and are working to resolve the issue. ask that people please refrain from further accessioning until the all-clear is given. - Turn off Google books retrieval.
- Do a graceful shutdown of resque-pool master (likely pres robots, maybe also pres cat). You can do this using
bundle exec cap resque:pool:stop
from the directory of the applicable project on your laptop. - Terminate any stuck worker processes that remain after stopping resque-pool (maybe pres robots, maybe pres cat -- either way, note stuck druids and age of hung jobs, even if only in Slack discussion about the issue, as this may be useful for later follow up, both when auditing for data corruption and when looking at logs to try to home in on the underlying Ceph issue that we've not yet figured out).
- Give the all clear to ops to reboot (likely just pres robots VM, maybe also pres cat VMs, but not the pres cat redis VM).
- Wait for ops to indicate completion of VM reboot(s).
- Re-enable the pres cat workers, let them work off any backlog. (
bundle exec cap resque:pool:hot_swap
from the pres cat project directory on your laptop) - Re-enable pres robots workers (same
hot_swap
command, but from pres robots' directory), let them work off any backlog (or at least a significant portion of the backlog, if it's very large, e.g. ifpreservationIngestWF
was stuck for a whole day). - Keep an eye on the Argo workflow grid for errors.
- If it appears that things are flowing normally again for the moment (e.g. if they've run without sticking for 30 minutes or so), give the all clear (for now) to
#dlss-aaas
to start accessioning again. - Run checksum validation audits and replication audits for objects accessioned within a day before the first reboot through the day after the last reboot, if there were multiple reboots within a few days before auditing was done. You can see an example of how to do this en masse from a text file generated from an appropriate Argo facet by looking at this comment and this comment. Note that you may have to lightly hand-edit the Argo facet URL to get exactly the date range you want for your results (note also that the Argo search uses UTC as its time zone, but you're probably thinking about this in terms of pacific time, and that's probably what many other systems use for their logs).
- Replication errors
- Validate moab step fails during preservationIngestWF
- ZipmakerJob failures
- Moab Audit Failures
- Ceph Errors
- Job queues
- Deposit bag was missing
- ActiveRecord and Replication intro
- 2018 Work Cycle Documentation
- Fixing a stuck Moab
- Adding a new cloud provider
- Audits (how to run as needed)
- Extracting segmented zipfiles
- AWS credentials, S3 configuration
- Zip Creation
- Storage Migration Additional Information
- Useful ActiveRecord queries
- IO against Ceph backed preservation storage is hanging indefinitely (steps to address IO problems, and follow on cleanup)