Skip to content

Integration with Kueue #68

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
kannon92 opened this issue Apr 13, 2025 · 1 comment
Open

Integration with Kueue #68

kannon92 opened this issue Apr 13, 2025 · 1 comment
Labels

Comments

@kannon92
Copy link

We have discussed some comparisons of other schedulers (#29).

I think it would be worth describing how a kueue integration would work.

KAI could support Jobs/Jobset/Pytorch jobs without much effort for Kueue.

For KAI support of services I think #63 is needed.

To expand on batch jobs, I think one needs to investigate if it is possible to use Kueue's ClusterQueues/LocalQueues in place of KAI Queues. To put it simple, Kueue integration (sans Topology Aware Scheduling) could be that Kueue handles queueing and resuming workloads once their is capacity in the cluster (queueing) and KAI can handle scheduling.

For KAI maintainers, the main request would be to figure out what would be lost if KAI's queueing logic was folded into Kueue. Is there anything missing in Kueue that would not allow KAI to utilize Kueue for queueing while leaving scheduling for KAI?

@enoodle
Copy link
Collaborator

enoodle commented Apr 14, 2025

Hi @kannon92

On the features side I think that Kueue is missing the concept of over quota weight or the differentiation between deserved quota and quota limits and instead it has other ways to describe quota
https://kueue.sigs.k8s.io/docs/concepts/cluster_queue/#flavors-and-resources
https://github.com/NVIDIA/KAI-Scheduler/blob/main/docs/queues/README.md#queue-resources

Another feature in KAI is that jobs can be defined as non preemptible, and we will want to refine the way it is used in the future.

KAI is pretty modular so you could run it with many actions/plugins turned off or configured differently.
For example if you want to integrate them today some of the actions (reclaim, preempt) and plugins (proportion) can be turned off and you could duplicate all the Kueue queues into KAI queues with infinite quota which I think will get Kueue to control the pods creation and KAI to try and schedule everything that is created.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: Todo
Development

No branches or pull requests

3 participants