How to troubleshoot worker instances running out of disk space? #473

mlin · 2022-06-01T09:05:52Z

mlin
Jun 1, 2022

We have some WDL tasks that rapidly extract tar archives to local storage, and seeing them sporadically fail with out-of-disk space errors. Saw the docs' blurb alluding to the use of amazon-ebs-autoscale on the one hand, but also how tasks might be able to "outrun" it on the other hand; but it doesn't go into solutions/workarounds.

Where should we start looking in logs etc. to troubleshoot this? For example, where might we find evidence to distinguish whether the EBS autoscaling logic maybe wasn't working at all versus whether it merely didn't act quickly enough? If indeed the latter, are there tuning options (as an AGC user) to make it expand earlier, or at least increase the initial size?

markjschreiber · 2022-06-02T19:56:03Z

markjschreiber
Jun 2, 2022
Maintainer

If the engine being used is miniwdl then it doesn't use the ebs auto expansion. That is largely used by the engines that don't use EFS and read data directly into the container (Cromwell and Nextflow). It might be that /tmp is filling up when unpacking the tar file. A possible solution might be to change TMP to use a directory on the EFS volume. Or to use a larger initial disk on the worker instances.

The location of the Batch worker logs is defined by the launch template in the cloudwatch agent config section https://github.com/aws/amazon-genomics-cli/blob/main/packages/cdk/lib/constructs/launch-template-data.ts#L28-L72

The provisioning (or not) of the ebs-auto-expansion is controlled in the provision script at https://github.com/aws/amazon-genomics-cli/blob/main/packages/cdk/lib/artifacts/batch-artifacts/ecs-additions/provision.sh#L100-L110

Mainly we don't provision the auto expansion for engines that use EFS as it attaches an additional EBS volume per worker EC2 which is typically not used.

1 reply

mlin Jun 8, 2022
Author

Thanks @markjschreiber. We've seen this with both Cromwell and miniwdl, but the exact pathologies might well differ. I'm still wondering, if (in the Cromwell case) we do find evidence that the EBS autoscaling didn't quite react quickly enough, can we tune it? (whether preferably in the context configuration, or else by forking AGC)

Or to use a larger initial disk on the worker instances.

Similarly wondering how to change this (configuration or fork)?

The location of the Batch worker logs is defined by the launch template in the cloudwatch agent config section

OK, we need to plug in the failed AGC task UUID to Batch (within 24 hours) to get the ECS containerInstanceArn, then use that to find the right stream in CloudWatch Logs -- is that right? any shortcut to this two-step process?

Mainly we don't provision the auto expansion for engines that use EFS as it attaches an additional EBS volume per worker EC2 which is typically not used.

Even for EFS-based engines, I think it's often useful for tasks to have scratch space to do temporary stuff (like extracting reference tars) without consuming the finite EFS throughput and burst credits. At least having the option makes sense to me.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to troubleshoot worker instances running out of disk space? #473

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

How to troubleshoot worker instances running out of disk space? #473

mlin Jun 1, 2022

Replies: 1 comment · 1 reply

markjschreiber Jun 2, 2022 Maintainer

mlin Jun 8, 2022 Author

mlin
Jun 1, 2022

Replies: 1 comment 1 reply

markjschreiber
Jun 2, 2022
Maintainer

mlin Jun 8, 2022
Author