Skip to content

[Sandbox] KAI Scheduler #372

Open
Open
@EkinKarabulut

Description

@EkinKarabulut

Application contact emails

[email protected], [email protected], [email protected], [email protected], [email protected]

Project summary

KAI Scheduler is a robust, efficient, and scalable Kubernetes scheduler that optimizes GPU resource allocation for AI workloads in large-scale clusters.

Project description

KAI Scheduler is a Kubernetes-native scheduler designed to optimize resource allocation for AI/ML workloads running on large-scale GPU clusters. It introduces support for gang scheduling, workload prioritization, resource-aware placement, and dynamic resource management tailored for environments running thousands or tens of thousands of GPU nodes with high workload throughput.

KAI Scheduler supports advanced scheduling policies such as bin-packing to reduce fragmentation, spread scheduling for load balancing and resiliency, and queue-based prioritization and quota management. It enables hierarchical queue structures, allowing organizations to assign quotas, priorities, and fairness constraints per queue (e.g., per team, project, or department).

Through integration with Kubernetes ResourceClaims (Dynamic Resource Allocation), it enables GPU sharing, helping to increase overall cluster utilization. Its ability to preempt and consolidate workloads ensures efficient use of scarce accelerator resources by dynamically reallocating underutilized resources.

Some key features:

  • Batch Scheduling: Ensure “all-or-nothing” scheduling for pod groups, preventing suboptimal scheduling of distributed workloads.
  • Bin Packing & Spread Scheduling: Optimize node usage either by minimizing fragmentation (bin-packing) or increasing resiliency and load balancing (spread scheduling).
  • Workload Priority: Prioritize workloads effectively within queues.
  • Hierarchical Queues: Manage workloads with two-level queue hierarchies for flexible organizational control.
  • Resource distribution: Customize quotas, over-quota weights, limits, and priorities per queue.
  • Fairness Policies: Ensure equitable resource distribution using Dominant Resource Fairness (DRF) and resource reclamation across queues.
  • Workload Consolidation: Reallocate running workloads intelligently to reduce fragmentation and increase cluster utilization.
  • Elastic Workloads: Dynamically scale workloads within defined minimum and maximum pod counts.
  • Dynamic Resource Allocation (DRA): Support vendor-specific hardware resources through Kubernetes ResourceClaims.
  • GPU Sharing: Allow multiple workloads to efficiently share single or multiple GPUs, maximizing resource utilization.
  • Cloud & On-premise Support: Fully compatible with dynamic cloud infrastructures (including auto-scalers like Karpenter) as well as static on-premise deployments.

Org repo URL (provide if all repos under the org are in scope of the application)

N/A

Project repo URL in scope of application

https://github.com/NVIDIA/KAI-Scheduler

Additional repos in scope of the application

No response

Website URL

https://github.com/NVIDIA/KAI-Scheduler

Roadmap

https://github.com/NVIDIA/KAI-Scheduler?tab=readme-ov-file#roadmap

Roadmap context

Our roadmap translates the feedback and ideas we have received from users and the broader community into a focused set of priorities. Our overarching goal is to support the full spectrum of AI workload scheduling use cases - from basic batch workloads to complex, multi-cluster AI pipelines. Key areas of focus include:

  • Scheduling Gates for conditional admission, enabling workloads to wait on specific prerequisites before consuming resources.
  • Integration with Kueue to support multi-cluster job dispatching within KAI’s scheduling framework.
  • Topology-aware PodGroup placement, ensuring that tightly coupled workloads are scheduled on nodes optimized for locality and performance.
  • Refactoring the codebase to enhance vendor neutrality.
  • AI-specific enhancements, such as support for minimum and maximum workload runtimes, additional default PriorityClasses, and more.

We are committed to building these features and more in close collaboration with the community, defining requirements together, co-authoring design proposals, and encouraging contributions via open discussion and pull requests. Our aim is to ensure each roadmap item reflects real-world needs and evolves under a vendor-neutral, community-first development model.

Contributing guide

https://github.com/NVIDIA/KAI-Scheduler/blob/main/CONTRIBUTING.md

Code of Conduct (CoC)

https://github.com/NVIDIA/KAI-Scheduler/blob/main/code_of_conduct.md

Adopters

No response

Contributing or sponsoring org

https://github.com/NVIDIA

Maintainers file

https://github.com/NVIDIA/KAI-Scheduler/blob/main/MAINTAINERS.md

IP policy

  • If the project is accepted, I agree the project will follow the CNCF IP Policy

Will the project require a license exception?

N/A

Trademark and accounts

  • If the project is accepted, I agree to donate all project trademarks and accounts to the CNCF

Standard or specification?

N/A

Why CNCF?

Donating the KAI Scheduler to the CNCF would provide a solid foundation for neutral, community-driven evolution of the project, while allowing it to more fully align with the broader cloud-native ecosystem.

  • Neutral Governance and Trust: CNCF’s vendor-neutral governance model ensures that the project’s direction is shaped by community needs rather than a single stakeholder. This fosters broader trust, especially among organizations deploying KAI Scheduler in production, and helps avoid concerns about vendor lock-in.
  • Ecosystem Alignment and Adoption: KAI Scheduler is designed to integrate cleanly with existing Kubernetes-native components and has initial support for projects like Kubeflow’s training operator. As part of the CNCF, we can accelerate its adoption by aligning more deeply with projects such as Kueue, and Karmada, enabling seamless orchestration of complex, distributed ML workloads.
  • Contributor and Community Growth: Joining CNCF offers exposure to a global base of contributors and users, which is essential for sustaining the long-term health of the project. A broader contributor base can help drive innovation, identify edge cases, and ensure responsiveness to the needs of varied environments.
  • Collaboration Across Projects: Many CNCF-hosted projects tackle complementary concerns, such as multi-cluster management, topology-aware scheduling, and observability. Being part of the foundation will enable direct collaboration with these efforts, ensuring KAI evolves in harmony with the rest of the ecosystem, while contributing to other project’s work.

Benefit to the landscape

Adding KAI Scheduler to the CNCF sandbox enriches the ecosystem of Kubernetes schedulers by filling several gaps:

  • The Podgrouper: Eases the integration with any type of workload - especially important for gang scheduling. It automatically groups related pods - e.g. PytorchJob, MPIJob etc. - ensuring that they are scheduled all together.
  • Pre-scheduling simulations: Before taking any actions, the scheduler runs simulations to figure out the most optimal actions to take. This makes sure that workload eviction is only done if other more prioritized workload can be scheduled, thus preventing unnecessary evictions.
  • Plugin-based scheduler logic: Allows users to extend the scheduler logic by implementing plugins
  • Focus on the entire AI lifecycle: Accommodates all types of workloads in the AI lifecycle in a cluster; inference, training, and interactive workloads.
  • Asynchronous Binder: Capable of binding the pods with resources asynchronously. It helps scaling with the scheduler to 10k+ GPU clusters.
  • Cloud auto-scale support: Scheduling works seamlessly with different auto-scalers, such as HPA, node auto scaler, and Karpenter - supporting both regular workloads and workloads that leverages gpu sharing.

Cloud native 'fit'

KAI aligns with core CNCF principles:

  • Declarative API: all scheduler primitives, like scheduling queues, podgroups and bindRequest are implemented as CRDs. Allowing k8s native integration with other frameworks running on k8s. Status is published natively on these CRDs.
  • Pluggability & modularity: scheduler logic and grouping strategies are implemented as plugins and can be extended by users.
  • Asynchronous controllers: Binder and PodGrouper use Kubernetes controller patterns for resilience and scalability. KAI Scheduler is built using a micro service paradigm, every service is responsible for a single task.
  • Cloud-native integrations: leverages existing CNCF projects (Kubeflow Training Operator, HPA, Cluster autoscaler, Karpenter, CSI, Device Plugins, Dynamic Resource Allocation) without proprietary extensions.
  • Support for Kubernetes-native workloads: KAI Scheduler is designed to work with standard Kubernetes objects - Pods, Deployments, StatefulSets, etc. This ensures that users can adopt KAI Scheduler without altering their existing workflows or application definitions.

Cloud native 'integration'

Dependency:

  • Kubernetes: KAI Scheduler is implemented as a set of Kubernetes controllers and CRDs. It provides features like gang-scheduling, hierarchical-queues, and plugin-based extensions, while preserving the native declarative and dynamically extensible model of Kubernetes.

Complementary Projects:

  • Knative: KAI can schedule GPU-powered knative services, providing bursty inference workloads with rapid scale-to-zero and scale-to-one strategies.
  • Kubeflow Training Operator: KAI’s PodGrouper automatically groups TFJob, MPIJobs and PyTorchJob pods, ensuring consistent gang scheduling for distributed training.
  • Cluster Autoscaler & Karpenter: By watching autoscaler events and injecting scale-in/scale-out hooks, KAI Scheduler ensures that GPU-driven workloads trigger right-sized node provisioning.
  • Prometheus & Grafana: KAI can export metrics for GPU metrics such as scheduler performance, resource division results between queues, number of evictions, number of workloads scheduled in a single scheduling cycle, scheduling cycle latency and scheduler action latency. These make it easy for the users to monitor the cluster state.

Cloud native overlap

  • Volcano: Volcano provides batch scheduling, gang scheduling, hierarchical queues, workload preemption, Dominant Resource Fairness (DRF), elastic workloads, bin-packing, and GPU sharing.
  • Kueue: Kueue focuses on job dispatching, batch scheduling, gang scheduling, multiple queues, hierarchical queues, cross-queue fairness, guaranteed quotas, and over-quota weights.

While all these projects implement core gang, and fair-share scheduling capabilities, KAI Scheduler was architected from day one for AI at scale and across the entire AI lifecycle. The PodGrouper automatically groups workloads for gang scheduling without manual intervention. Before taking any action on running pods, KAI Scheduler runs a pre-scheduling simulation. This makes sure that higher-priority workloads can be scheduled for sure after an eviction of a lower priority workload, drastically reducing needless evictions. Finally, with asynchronous Binder, KAI maintains high scheduler throughput at 10k+ GPU scale and integrates seamlessly with cloud autoscalers.

Similar projects

Volcano, Kueue

Landscape

No

Business Product or Service to Project separation

KAI Scheduler will remain the upstream open-source engine powering NVIDIA Run:ai platform. All development related to the scheduler occurs in the community repository under Apache 2.0. NVIDIA Run:ai will continue to use the same code as the community, ensuring no divergence and maximum benefit to all adopters.

Project "Domain Technical Review"

No response

CNCF contacts

Kevin Klues, Ricardo Rocha

Additional information

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    Status

    🏗 Upcoming

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions