Proposal: Implement Overscheduling in Armada #2533

dejanzele · 2023-06-02T13:05:25Z

dejanzele
Jun 2, 2023
Collaborator

Problem Statement

Currently, Armada Kubernetes Batch Scheduler assumes a fixed pool of resources such as nodes, CPU, memory, and GPU for its scheduling algorithms. However, this approach poses challenges when working with Kubernetes Autoscalers like Cluster Autoscaler or cloud vendor-specific autoscalers. Armada Server denies submission requests for workloads that cannot be scheduled due to limited resources.

Proposed Solution: Implement Overscheduling

To address this issue, we propose implementing the concept of overscheduling in Armada. Overscheduling involves submitting workloads to Executor clusters, even if there may not be enough resources initially available. This will trigger an autoscaler to add more nodes to the Executor cluster, enabling successful scheduling of the pending pods.

Implementation Details

The implementation of overscheduling in Armada Kubernetes Batch Scheduler would involve the following steps:

Configuration: Introduce a new configuration parameter in Armada Server to define the overscheduling percentage. This parameter will determine the amount of additional workload that can be submitted, even if resources are temporarily insufficient.
Submission to Executor Cluster: When Armada Server receives a workload submission request, it will check the available resources in the Executor cluster. If the current resources are below the overscheduling threshold, Armada Server will submit the workload to the Executor Kubernetes cluster.
Pending Pods: Upon submission, the pods will enter the pending state in the Executor Kubernetes cluster due to resource scarcity. This is an expected behavior in overscheduling.
Autoscaling Trigger: The presence of pending pods will trigger the autoscaler (e.g., Cluster Autoscaler) associated with the Executor Kubernetes cluster. The autoscaler will evaluate the pending workload and initiate the scaling process by adding more nodes to the Executor cluster.
Resource Availability and Scheduling: As the autoscaler provisions additional nodes, the resources will become available in the Executor cluster. Kubernetes scheduler will then schedule the pending pods onto the newly added nodes, utilizing the now available resources.

Benefits

By implementing overscheduling in Armada Kubernetes Batch Scheduler, the following benefits can be realized:

Improved Resource Utilization: Overscheduling allows Armada to make better use of resources by utilizing them to their maximum capacity. It avoids resource wastage during periods of lower workload.
Seamless Integration with Autoscalers: Armada can work seamlessly with Kubernetes Cluster Autoscaler and cloud vendor-specific autoscalers. Autoscalers can detect pending pods and automatically scale the Executor cluster to meet the additional resource requirements.
Enhanced Scalability: Overscheduling ensures that Armada can handle spikes in workload demand by dynamically scaling the Executor cluster. It allows for better responsiveness to workload fluctuations.
Flexibility and Customization: The overscheduling percentage can be configured based on specific requirements and workload characteristics. Armada users can fine-tune this parameter to achieve the desired balance between overscheduling and resource availability.

Technical Details

Introduce a new parameter in Executor called overscheduleMultipler which allows overscheduling up to largest node resources x the overscheduling multiplier.
Edit Executor to report back does it support overscheduling and how much.
Edit scheduler logic to allow overscheduling if an executor cluster supports it.

Example

Let's assume the following 30-day analysis of workload distribution.
On average, 350 workloads per day were scheduled using a cluster of 5 nodes, which roughly translates that one node can service 70 workloads. Let's further assume that for the 20 days we had 20 workloads/day and for the remaining 10 days we had 1000 workloads/day.
Without autoscaling, for 20 days, the cluster remained underutilized with only 20 workloads per day and 4 extra nodes running, while for the remaining 10 days, the workload surged to 1000 workloads per day, exceeding the cluster's capacity by lacking additional 10 nodes.

For 20 days we payed for more resources than needed, and for 10 days our cluster could not handle the workload frequency.

Scenarios

Let's assume our Kubernetes cluster consists of a couple of t3.small (2 vCPU, 2GB memory) instances, and that autoscaler is configured to provide instance types t3.small, t3.medium, t3.large.

Job is larger than current Node instance types

The cluster is limited in terms of resources, and an Armada Job requires more resources than a single t3.small node can provide. To determine whether the job can be overscheduled, the condition jobResources <= largest(nodeResources) * overscheduleMultiplier must be satisfied. If the condition holds true, Armada will trigger overscheduling by submitting the job to the Executor cluster, even if there are not enough resources initially available. This will initiate the autoscaler to add more nodes to the Executor cluster, allowing the job to be scheduled and executed successfully.

Job is smaler than current Node instance types

In this scenario, Armada Job resource requirements are less or equal to the resource amount of the largest node instance type, but the cluster lacks more nodes to handle additional workloads. In that case, Armada should allow the submission of the job, even if the current node capacity is insufficient. This triggers the autoscaler to take action by provisioning additional nodes to the Executor cluster. As a result, the required resources become available, and the job can be scheduled and executed successfully.

Open Questions

How can autoscaling fit with preemption?
Are autoscaling and preemption two mutually exclusive modes of operation?
Should Armada have config support to stop overscheduling or should that be the responsibility of the Autoscaler?
What would happen if Armada continues submiting Jobs which go into Pending state because Autoscaler hit scaling limits?

Conclusion

Implementing overscheduling in Armada will enable seamless integration with Autoscalers and improve resource utilization. By introducing the concept of overscheduling, Armada can dynamically adapt to workload demands and effectively scale its Executor clusters. This enhancement will provide users with more flexibility and customization options, ensuring optimal performance and resource allocation in Kubernetes batch scheduling scenarios.

dave-gantenbein · 2023-06-02T18:14:21Z

dave-gantenbein
Jun 2, 2023
Maintainer

Questions arising from team discussion around this feature that we need to resolve:

Initial assumption: autoscaled jobs are not preempt-able by definition, yet...

Once the autoscaled jobs have reached the max configured limit, what happens? Do they then become preempt-able again? What else happens when autoscale limits have been reached?

How do we decide what workload is autoscable? Via metadata? Algorithmically by priority?

need to configure limits: by multiplier with a max?

Priority, how do we infer it? from queue? from podspec?

How do we downscale when autoscaled jobs have been completed?

0 replies

kannon92 · 2023-06-11T20:49:31Z

kannon92
Jun 11, 2023

I think this goes well with our idea of a toggleable scheduler algorithm. In general I think this could be useful for autoscaling and for scheduling to clusters that have a queueing solution enabled (YuniKorn/Kueue).

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: Implement Overscheduling in Armada #2533

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments

{{title}}

{{title}}

Select a reply

Proposal: Implement Overscheduling in Armada #2533

dejanzele Jun 2, 2023 Collaborator

Problem Statement

Proposed Solution: Implement Overscheduling

Implementation Details

Benefits

Technical Details

Example

Scenarios

Job is larger than current Node instance types

Job is smaler than current Node instance types

Open Questions

Conclusion

Replies: 2 comments

dave-gantenbein Jun 2, 2023 Maintainer

kannon92 Jun 11, 2023

dejanzele
Jun 2, 2023
Collaborator

dave-gantenbein
Jun 2, 2023
Maintainer

kannon92
Jun 11, 2023