Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

ECSWorker fails to submit tasks to cluster #301

Closed
m-steinhauer opened this issue Aug 1, 2023 · 0 comments · Fixed by #303
Closed

ECSWorker fails to submit tasks to cluster #301

m-steinhauer opened this issue Aug 1, 2023 · 0 comments · Fixed by #303

Comments

@m-steinhauer
Copy link

m-steinhauer commented Aug 1, 2023

I'm facing the following issue: We are using an EC2 backed ECS cluster on AWS and a Prefect Flow that triggers a certain amount of subflows using the run_deployment method. Lets say we have a capacity for 20 tasks in our cluster and are triggering 200, then most of the time the first 20 submitted flows will fail immediately (and are also not retryied). Other events are submitted to queue and processed correctly. Sometimes the submission fails randomly during the flows. We also limited the concurrency on the queue and the Prefect worker according to our available capacity.

Expectation

Flows are submitted successfully or queued if no capacity in the cluster is available.

It looks like that the ECS client is sometimes not able to put the task on the cluster as it fails with the out of index exception below. Sadly I cannot see any more details coming from the AWS response so it is hard to analyze the reason why the task can not be placed.

Environment

prefect 2.11.1
prefect-aws 0.3.6
python 3.10
ECS cluster with EC2 instances

Traceback

Failed to submit flow run '1ab332c5-d7e5-43bb-a9de-089eb115b0ec' to infrastructure.
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/prefect/workers/base.py", line 834, in _submit_run_and_capture_errors
    result = await self.run(
  File "/usr/local/lib/python3.10/dist-packages/prefect_aws/workers/ecs_worker.py", line 538, in run
    ) = await run_sync_in_worker_thread(
  File "/usr/local/lib/python3.10/dist-packages/prefect/utilities/asyncutils.py", line 91, in run_sync_in_worker_thread
    return await anyio.to_thread.run_sync(
  File "/usr/local/lib/python3.10/dist-packages/anyio/to_thread.py", line 33, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
  File "/usr/local/lib/python3.10/dist-packages/anyio/_backends/_asyncio.py", line 877, in run_sync_in_worker_thread
    return await future
  File "/usr/local/lib/python3.10/dist-packages/anyio/_backends/_asyncio.py", line 807, in run
    result = context.run(func, *args)
  File "/usr/local/lib/python3.10/dist-packages/prefect_aws/workers/ecs_worker.py", line 695, in _create_task_and_wait_for_start
    self._report_task_run_creation_failure(configuration, task_run_request, exc)
  File "/usr/local/lib/python3.10/dist-packages/prefect_aws/workers/ecs_worker.py", line 691, in _create_task_and_wait_for_start
    task = self._create_task_run(ecs_client, task_run_request)
  File "/usr/local/lib/python3.10/dist-packages/prefect_aws/workers/ecs_worker.py", line 1424, in _create_task_run
    return ecs_client.run_task(**task_run_request)["tasks"][0]
IndexError: list index out of range

Improvement

Make the _create_task_run method more robust in case the ECS client cannot submit the job on the first try to the underlying cluster.

@m-steinhauer m-steinhauer changed the title ECSWorker fails to submit tasks cluster ECSWorker fails to submit tasks to cluster Aug 1, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant