Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Improve error, if ECS task can not be submitted #282

Closed
1 task done
brodul opened this issue Jun 16, 2023 · 5 comments · Fixed by #406
Closed
1 task done

Improve error, if ECS task can not be submitted #282

brodul opened this issue Jun 16, 2023 · 5 comments · Fixed by #406
Assignees

Comments

@brodul
Copy link

brodul commented Jun 16, 2023

Improve Error trace, if the ECS worker can not create ECS tasks. One example would be if the Fargate vCpu quota is met.

Traceback / Example

image

image

image

Currently an IndexError is returned in the stack. And in the Cloud UI State: Message is: Submission failed. IndexError: list index out of range

Expectation / Proposal

More pass more details to the error. Maybe wrap the error in another error like SubmissionError

@coffeeandcloud
Copy link

Hi @brodul I'm experiencing the same issues. We are using the ECS cluster with an EC2 capacity provider but Prefect tries to add an endless amount of flow runs (and therefore ECS tasks) without checking the available capacity first. The team stated this should have been solved in this issue but still experiencing the problem.

By looking at the code the issue comes by just firing the run_task method of boto without checking the result. From my understanding, Prefect should be able to use the queues in order to handle the maximal amount of tasks the cluster is able to run.

@brodul
Copy link
Author

brodul commented Oct 24, 2023

@coffeeandcloud Hey, I have figured that it's not just hitting the quota that fails with this error. We have increased the quota and we can not hit it currently.

Tnx for the retry fix.
I think that the correct approach should be to deserialize the response in a more defensive way raise the exception with a more meaningful message.

@rmnvncnt
Copy link

rmnvncnt commented Feb 16, 2024

In my case, the retry just adds a layer of traceback but does not help targeting the cause of the issue :

Failed to submit flow run '8d726272-1482-4a6e-94fb-7d160779d2de' to infrastructure.
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/tenacity/__init__.py", line 382, in __call__
    result = fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/prefect_aws/workers/ecs_worker.py", line 1557, in _create_task_run
    return ecs_client.run_task(**task_run_request)["tasks"][0]
IndexError: list index out of range

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/prefect/workers/base.py", line 896, in _submit_run_and_capture_errors
    result = await self.run(
  File "/usr/local/lib/python3.10/site-packages/prefect_aws/workers/ecs_worker.py", line 598, in run
    ) = await run_sync_in_worker_thread(
  File "/usr/local/lib/python3.10/site-packages/prefect/utilities/asyncutils.py", line 95, in run_sync_in_worker_thread
    return await anyio.to_thread.run_sync(
  File "/usr/local/lib/python3.10/site-packages/anyio/to_thread.py", line 33, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
  File "/usr/local/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 877, in run_sync_in_worker_thread
    return await future
  File "/usr/local/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 807, in run
    result = context.run(func, *args)
  File "/usr/local/lib/python3.10/site-packages/prefect_aws/workers/ecs_worker.py", line 761, in _create_task_and_wait_for_start
    self._report_task_run_creation_failure(configuration, task_run_request, exc)
  File "/usr/local/lib/python3.10/site-packages/prefect_aws/workers/ecs_worker.py", line 757, in _create_task_and_wait_for_start
    task = self._create_task_run(ecs_client, task_run_request)
  File "/usr/local/lib/python3.10/site-packages/tenacity/__init__.py", line 289, in wrapped_f
    return self(f, *args, **kw)
  File "/usr/local/lib/python3.10/site-packages/tenacity/__init__.py", line 379, in __call__
    do = self.iter(retry_state=retry_state)
  File "/usr/local/lib/python3.10/site-packages/tenacity/__init__.py", line 326, in iter
    raise retry_exc from fut.exception()
tenacity.RetryError: RetryError[<Future at 0xffff4eb82440 state=finished raised IndexError>]

I get this error semi-randomly using Fargate, so it might not be a quota issue.

EDIT : it was a quota issue. I was asking for more vCPUs than available. It would be nice to get a more explicit error. I might have a look at that in the future.

@jeanluciano jeanluciano self-assigned this Mar 14, 2024
@rmnvncnt
Copy link

FYI : this error also happens if you set EC2 as launch type, but the EC2 capacity provider does not allow enough ressource for the task to be started. For instance, you need 2 vCPUs but your capacity provider can only launch 1 vCPU instances. This problem is different from a quota issue using Fargate, yet Prefect returns the same traceback.

@MohammedSiddiqui
Copy link

Facing a similar issue where my agent logs out the following message:

IndexError: list index out of range

I can see that the boto's ECS library spits out a failures key in the response instead of throwing an Exception. If we can simply log that, it should help solve most of our problems. Right now I'm clueless why my tasks randomly fail to submit to the ECS Cluster.
https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/ecs/client/run_task.html

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants