Skip to content

[Feature] [scheduler-plugins] Support second scheduler mode #3852

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Jul 21, 2025

Conversation

CheyuWu
Copy link
Contributor

@CheyuWu CheyuWu commented Jul 9, 2025

Why are these changes needed?

Currently, KubeRay only supports scheduler plugins when it is deployed as a single scheduler.
This change adds support for using a secondary scheduler with scheduler-plugins

Manual Testing

Common Portion

Ray operator setup

Set helm-chart/kuberay-operator/values.yaml's batchScheduler.name to scheduler-plugins

batchScheduler:
  enabled: false
  name: "scheduler-plugins"

Testing YAML file

  • Create a yaml file - deploy.yaml
# deploy.yaml
apiVersion: ray.io/v1
kind: RayCluster
metadata:
  name: raycluster-kuberay
  labels:
    ray.io/gang-scheduling-enabled: "true"
    ray.io/scheduler-name: scheduler-plugins
spec:
  rayVersion: '2.46.0'
  headGroupSpec:
    rayStartParams: {}
    template:
      spec:
        containers:
        - name: ray-head
          image: rayproject/ray:2.46.0
          resources:
            limits:
              cpu: 1
              memory: 2G
            requests:
              cpu: 1
              memory: 2G
          ports:
          - containerPort: 6379
            name: gcs-server
          - containerPort: 8265
            name: dashboard
          - containerPort: 10001
            name: client
  workerGroupSpecs:
  - replicas: 3
    minReplicas: 1
    maxReplicas: 5
    groupName: workergroup
    rayStartParams: {}
    template:
      spec:
        containers:
        - name: ray-worker
          image: rayproject/ray:2.46.0
          resources:
            limits:
              cpu: 1
              memory: 1G
            requests:
              cpu: 1
              memory: 1G

Single scheduler

CoScheduler setup

Follow the instruction - Reference

  • Log into the control plane node

    $ sudo docker exec -it $(sudo docker ps | grep control-plane | awk '{print $1}') 
  • Backup kube-scheduler.yaml

    $ cp /etc/kubernetes/manifests/kube-scheduler.yaml /etc/kubernetes/kube-scheduler.yaml
  • Install vim in kube-scheduler-kind-control-plane

    $ apt update
    $ apt install vim
  • Fix the permission problem in kube-scheduler-kind-control-plane

    $ chmod 644 /etc/kubernetes/scheduler.conf
  • Create /etc/kubernetes/sched-cc.yaml

    Keep both default-scheduler and scheduler-plugins-scheduler to make sure the ray-operator can be deployed.

    apiVersion: kubescheduler.config.k8s.io/v1
    kind: KubeSchedulerConfiguration
    leaderElection:
      # (Optional) Change true to false if you are not running a HA control-plane.
      leaderElect: true
    clientConnection:
      kubeconfig: /etc/kubernetes/scheduler.conf
    profiles:
    - schedulerName: default-scheduler
      plugins:
        queueSort:
          enabled:
            - name: Coscheduling
          disabled:
            - name: PrioritySort
        multiPoint:
          enabled:
            - name: Coscheduling
    - schedulerName: scheduler-plugins-scheduler
      plugins:
        queueSort:
          enabled:
            - name: Coscheduling
          disabled:
            - name: PrioritySort
        multiPoint:
          enabled:
          - name: Coscheduling
  • Install all-in-one.yaml (outside the pod)

    $ kubectl apply -f manifests/install/all-in-one.yaml
    
  • Apply missing YAML (outside the pod)

    $ k apply -f manifests/crds/scheduling.x-k8s.io_elasticquotas.yaml
  • Check the deployment (outside the pod)

    $ kubectl get deploy -n scheduler-plugins
    NAME                           READY   UP-TO-DATE   AVAILABLE   AGE
    scheduler-plugins-controller   1/1     1            1           116s
  • Install podgroup crds (outside the pod)

    $ kubectl apply -f manifests/crds/scheduling.x-k8s.io_podgroups.yaml
  • Modify /etc/kubernetes/manifests/kube-scheduler.yaml
    You can see the instruction for more detals

    apiVersion: v1
    kind: Pod
    metadata:
      creationTimestamp: null
      labels:
        component: kube-scheduler
        tier: control-plane
      name: kube-scheduler
      namespace: kube-system
    spec:
      containers:
      - command:
        - kube-scheduler
        - --authentication-kubeconfig=/etc/kubernetes/scheduler.conf
        - --authorization-kubeconfig=/etc/kubernetes/scheduler.conf
        - --bind-address=127.0.0.1
        - --config=/etc/kubernetes/sched-cc.yaml
        image: registry.k8s.io/scheduler-plugins/kube-scheduler:v0.31.8
        imagePullPolicy: IfNotPresent
        livenessProbe:
          failureThreshold: 8
          httpGet:
            host: 127.0.0.1
            path: /healthz
            port: 10259
            scheme: HTTPS
          initialDelaySeconds: 10
          periodSeconds: 10
          timeoutSeconds: 15
        name: kube-scheduler
        resources:
          requests:
            cpu: 100m
        startupProbe:
          failureThreshold: 24
          httpGet:
            host: 127.0.0.1
            path: /healthz
            port: 10259
            scheme: HTTPS
          initialDelaySeconds: 10
          periodSeconds: 10
          timeoutSeconds: 15
        volumeMounts:
        - mountPath: /etc/kubernetes/scheduler.conf
          name: kubeconfig
          readOnly: true
        - mountPath: /etc/kubernetes/sched-cc.yaml
          name: sched-cc
          readOnly: true
      hostNetwork: true
      priority: 2000001000
      priorityClassName: system-node-critical
      securityContext:
        seccompProfile:
          type: RuntimeDefault
      volumes:
      - hostPath:
          path: /etc/kubernetes/scheduler.conf
          type: FileOrCreate
        name: kubeconfig
      - hostPath:
          path: /etc/kubernetes/sched-cc.yaml
          type: FileOrCreate
        name: sched-cc
    status: {}
  • Verify that kube-scheduler pod is running properly

    $ kubectl get pod -n kube-system | grep kube-scheduler
    
    kube-scheduler-kind-control-plane            1/1     Running   0          77s
    
    $ kubectl get pods -l component=kube-scheduler -n kube-system -o=jsonpath="{.items[0].spec.containers[0].image}{'\n'}"
    
    registry.k8s.io/scheduler-plugins/kube-scheduler:v0.31.8

Apply deploy.yaml

Install the ray operator first

Run Cmd to deploy raycluster with scheduler-plugins-scheduler and gang-scheduling-enabled

$ k apply -f deploy.yaml

Result

Get Status

$ k get raycluster

NAME                 DESIRED WORKERS   AVAILABLE WORKERS   CPUS   MEMORY   GPUS   STATUS   AGE
raycluster-kuberay   3                 3                   4      5G       0      ready    84s
$ k get podgroup raycluster-kuberay -o yaml

apiVersion: scheduling.x-k8s.io/v1alpha1
kind: PodGroup
metadata:
  creationTimestamp: "2025-07-15T18:09:58Z"
  generation: 1
  name: raycluster-kuberay
  namespace: default
  ownerReferences:
  - apiVersion: ray.io/v1
    kind: RayCluster
    name: raycluster-kuberay
    uid: 7e4b3012-0592-4116-82c1-cb59467c1a38
  resourceVersion: "2238"
  uid: e1bef4b0-1cc7-407f-be61-eeba94ff3442
spec:
  minMember: 4
  minResources:
    cpu: "4"
    memory: 5G
status:
  occupiedBy: default/raycluster-kuberay
  phase: Running
  running: 4

Get scheduler Name - ray operator & ray head & ray worker

$ k get pods -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.schedulerName}{"\n"}{end}'

kuberay-operator-77596879bc-m7csr       default-scheduler
raycluster-kuberay-head scheduler-plugins-scheduler
raycluster-kuberay-workergroup-worker-68ht4     scheduler-plugins-scheduler
raycluster-kuberay-workergroup-worker-m8sv9     scheduler-plugins-scheduler
raycluster-kuberay-workergroup-worker-x2ghh     scheduler-plugins-scheduler

Remove the deployment

$ k delete -f deploy.yaml

Modify the deploy.yaml and apply it

  workerGroupSpecs:
  - replicas: 100
    minReplicas: 1
    maxReplicas: 200

Run cmd to check is all of the pod in pending status

$ kubectl get pods -A

NAMESPACE            NAME                                            READY   STATUS    RESTARTS      AGE
default              kuberay-operator-77596879bc-m7csr               1/1     Running   0             3m32s
default              raycluster-kuberay-head                         0/1     Pending   0             10s
default              raycluster-kuberay-workergroup-worker-2sl5s     0/1     Pending   0             9s
default              raycluster-kuberay-workergroup-worker-2svs6     0/1     Pending   0             7s
default              raycluster-kuberay-workergroup-worker-2xh6d     0/1     Pending   0             7s
default              raycluster-kuberay-workergroup-worker-4575m     0/1     Pending   0             10s
... -> skip lots of pending worker pods
default              raycluster-kuberay-workergroup-worker-znhp5     0/1     Pending   0             7s
default              raycluster-kuberay-workergroup-worker-zntkk     0/1     Pending   0             6s
default              raycluster-kuberay-workergroup-worker-zvvsh     0/1     Pending   0             6s
default              raycluster-kuberay-workergroup-worker-zz6tw     0/1     Pending   0             7s
kube-system          coredns-6f6b679f8f-cngxt                        1/1     Running   0             22m
kube-system          coredns-6f6b679f8f-nprq6                        1/1     Running   0             22m
kube-system          etcd-kind-control-plane                         1/1     Running   0             22m
kube-system          kindnet-hnsvz                                   1/1     Running   0             22m
kube-system          kube-apiserver-kind-control-plane               1/1     Running   0             22m
kube-system          kube-controller-manager-kind-control-plane      1/1     Running   0             22m
kube-system          kube-proxy-jbzqc                                1/1     Running   0             22m
kube-system          kube-scheduler-kind-control-plane               1/1     Running   0             6m59s
local-path-storage   local-path-provisioner-57c5987fd4-8k26v         1/1     Running   0             22m
scheduler-plugins    scheduler-plugins-controller-845cfd89c6-886bv   1/1     Running   1 (11m ago)   13m

Delete cluster for the next testing

$ kind delete cluster

Second scheduler

According to the instruction - Reference

Install the scheduler-plugins

$ helm install --repo https://scheduler-plugins.sigs.k8s.io scheduler-plugins scheduler-plugins

Check the scheduler-plugins is running

$ kubectl get deploy

NAME                           READY   UP-TO-DATE   AVAILABLE   AGE
kuberay-operator               1/1     1            1           14m
scheduler-plugins-controller   1/1     1            1           5m37s
scheduler-plugins-scheduler    1/1     1            1           5m37s

Ray operator and apply config

Set helm-chart/kuberay-operator/values.yaml batchScheduler.name to scheduler-plugins

batchScheduler:
  enabled: false
  name: "scheduler-plugins"

Apply deploy.yaml

Run Cmd to deploy raycluster with scheduler-plugins-scheduler and gang-scheduling-enabled

$ k apply -f deploy.yaml

Result

Get Status

$ k get raycluster

NAME                 DESIRED WORKERS   AVAILABLE WORKERS   CPUS   MEMORY   GPUS   STATUS   AGE
raycluster-kuberay   3                 3                   4      5G       0      ready    107s
$ k get podgroup raycluster-kuberay -o yaml

apiVersion: scheduling.x-k8s.io/v1alpha1
kind: PodGroup
metadata:
  creationTimestamp: "2025-07-15T17:43:23Z"
  generation: 1
  name: raycluster-kuberay
  namespace: default
  ownerReferences:
  - apiVersion: ray.io/v1
    kind: RayCluster
    name: raycluster-kuberay
    uid: 0598303a-d711-4307-b4ef-ada38731dce7
  resourceVersion: "2047"
  uid: eb1c7c09-dcdd-4664-b23c-af5e2584a25c
spec:
  minMember: 4
  minResources:
    cpu: "4"
    memory: 5G
status:
  occupiedBy: default/raycluster-kuberay
  phase: Running
  running: 4

Get scheduler Name - ray operator & ray head & ray worker

$ k get pods -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.schedulerName}{"\n"}{end}'

kuberay-operator-77596879bc-g5zvx       default-scheduler
raycluster-kuberay-head scheduler-plugins-scheduler
raycluster-kuberay-workergroup-worker-fjgjw     scheduler-plugins-scheduler
raycluster-kuberay-workergroup-worker-fxnvr     scheduler-plugins-scheduler
raycluster-kuberay-workergroup-worker-t69t2     scheduler-plugins-scheduler
scheduler-plugins-controller-845cfd89c6-j4h6b   default-scheduler
scheduler-plugins-scheduler-5dd667cb77-lr6tg    default-scheduler

Remove the deployment

$ k delete -f deploy.yaml

Modify the deploy.yaml and apply it

  workerGroupSpecs:
  - replicas: 100
    minReplicas: 1
    maxReplicas: 200

Run cmd to check is all of the pod in pending status

$ kubectl get pods -A

NAMESPACE            NAME                                            READY   STATUS    RESTARTS   AGE
default              kuberay-operator-77596879bc-g5zvx               1/1     Running   0          12m
default              raycluster-kuberay-head                         0/1     Pending   0          13s
default              raycluster-kuberay-workergroup-worker-2gqpl     0/1     Pending   0          12s
default              raycluster-kuberay-workergroup-worker-2qdjs     0/1     Pending   0          12s
default              raycluster-kuberay-workergroup-worker-2t6d4     0/1     Pending   0          11s
default              raycluster-kuberay-workergroup-worker-4jwgj     0/1     Pending   0          10s
default              raycluster-kuberay-workergroup-worker-52vw7     0/1     Pending   0          12s
... -> skip lots of pending worker pods
default              raycluster-kuberay-workergroup-worker-zckrr     0/1     Pending   0          12s
default              raycluster-kuberay-workergroup-worker-zfpzb     0/1     Pending   0          12s
default              raycluster-kuberay-workergroup-worker-zvzhm     0/1     Pending   0          10s
default              scheduler-plugins-controller-845cfd89c6-j4h6b   1/1     Running   0          4m16s
default              scheduler-plugins-scheduler-5dd667cb77-lr6tg    1/1     Running   0          4m16s
kube-system          coredns-6f6b679f8f-p62kv                        1/1     Running   0          16m
kube-system          coredns-6f6b679f8f-xzn42                        1/1     Running   0          16m
kube-system          etcd-kind-control-plane                         1/1     Running   0          16m
kube-system          kindnet-mwhsd                                   1/1     Running   0          16m
kube-system          kube-apiserver-kind-control-plane               1/1     Running   0          16m
kube-system          kube-controller-manager-kind-control-plane      1/1     Running   0          16m
kube-system          kube-proxy-mt8q4                                1/1     Running   0          16m
kube-system          kube-scheduler-kind-control-plane               1/1     Running   0          16m
local-path-storage   local-path-provisioner-57c5987fd4-t5dmn         1/1     Running   0          16m

Related issue number

Closes #3769

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

@CheyuWu CheyuWu force-pushed the feat/second-schedule branch from 09c3853 to ea71807 Compare July 9, 2025 14:52
@CheyuWu
Copy link
Contributor Author

CheyuWu commented Jul 9, 2025

Hi @kevin85421, PTAL

@kevin85421
Copy link
Member

Why do you use single scheduler for manual test?

@kevin85421
Copy link
Member

cc @troychiu for review

@CheyuWu
Copy link
Contributor Author

CheyuWu commented Jul 10, 2025

Why do you use single scheduler for manual test?

Hi @kevin85421
Although both default-scheduler and scheduler-plugins are configured in /etc/kubernetes/sched-cc.yaml
the Ray pods (head and workers) are explicitly assigned to the scheduler-plugins scheduler, as shown in:

labels:
  ray.io/scheduler-name: scheduler-plugins

and verified via:

$ kubectl get pods -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.schedulerName}{"\n"}{end}'

This setup follows the multi-scheduler setup, where KubeRay operator itself is scheduled using default-scheduler, and RayCluster pods are scheduled using scheduler-plugins.

I’ll revise the wording in the PR description to avoid confusion around the "single scheduler" statement.

Copy link
Member

@kevin85421 kevin85421 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I understand it, you deploy scheduler-plugins in "single scheduler" mode to replace the default scheduler. For "second scheduler" mode, you need to use the Helm chart to install scheduler-plugins in a separate Pod.

https://github.com/kubernetes-sigs/scheduler-plugins/blob/93126eabdf526010bf697d5963d849eab7e8e898/doc/install.md#as-a-second-scheduler
image

@CheyuWu
Copy link
Contributor Author

CheyuWu commented Jul 10, 2025

As I understand it, you deploy scheduler-plugins in "single scheduler" mode to replace the default scheduler. For "second scheduler" mode, you need to use the Helm chart to install scheduler-plugins in a separate Pod.

https://github.com/kubernetes-sigs/scheduler-plugins/blob/93126eabdf526010bf697d5963d849eab7e8e898/doc/install.md#as-a-second-scheduler image

Ops, I have a misunderstanding. I will use the second scheduler mode instead.

@CheyuWu
Copy link
Contributor Author

CheyuWu commented Jul 11, 2025

Hi @kevin85421 @troychiu, I have updated the manual testing procedure, PTAL

@CheyuWu
Copy link
Contributor Author

CheyuWu commented Jul 12, 2025

I have also updated the 100 pods manual testing, and all of them are in pending status

@kevin85421
Copy link
Member

I have also updated the 100 pods manual testing, and all of them are in pending status

Have you tested for both single scheduler and second scheduler for this 100 Pods RayCluster CR?

@@ -90,8 +90,7 @@ func (k *KubeScheduler) AddMetadataToPod(_ context.Context, app *rayv1.RayCluste
if k.isGangSchedulingEnabled(app) {
pod.Labels[kubeSchedulerPodGroupLabelKey] = app.Name
}
// TODO(kevin85421): Currently, we only support "single scheduler" mode. If we want to support
// "second scheduler" mode, we need to add `schedulerName` to the pod spec.
pod.Spec.SchedulerName = k.Name()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines 136 to 143
if cluster.Labels == nil {
cluster.Labels = make(map[string]string)
}
if tt.enableGang {
cluster.Labels["ray.io/gang-scheduling-enabled"] = "true"
} else {
delete(cluster.Labels, "ray.io/gang-scheduling-enabled")
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this be cleaner?

Suggested change
if cluster.Labels == nil {
cluster.Labels = make(map[string]string)
}
if tt.enableGang {
cluster.Labels["ray.io/gang-scheduling-enabled"] = "true"
} else {
delete(cluster.Labels, "ray.io/gang-scheduling-enabled")
}
cluster.Labels = make(map[string]string)
if tt.enableGang {
cluster.Labels["ray.io/gang-scheduling-enabled"] = "true"
}

scheduler := &KubeScheduler{}
scheduler.AddMetadataToPod(context.TODO(), &cluster, "worker", pod)

if tt.expectedPodGroup {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we simply use enableGang instead of having another parameter? I think they have similar intention.

@troychiu
Copy link
Contributor

troychiu commented Jul 13, 2025

As @kevin85421 mentioned, can you also double check if both modes work fine?

@CheyuWu
Copy link
Contributor Author

CheyuWu commented Jul 13, 2025

@kevin85421 @troychiu ,

  • I have updated the Manual Testing portion for both the single scheduler and the second scheduler.
  • Use scheduler-plugin-scheduler instead
  • Fix the redundant parameter in the test

@@ -21,7 +21,7 @@ import (
)

const (
schedulerName string = "scheduler-plugins"
schedulerName string = "scheduler-plugins-scheduler"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

@CheyuWu CheyuWu Jul 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is important. I will add the comment.

@@ -69,13 +69,13 @@ logging:
#
# 4. Use PodGroup
# batchScheduler:
# name: scheduler-plugins
# name: scheduler-plugins-scheduler
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For user facing config, I am not sure if we should use "scheduler-plugins" or "scheduler-plugins-scheduler". Wdyt?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, and it's easier to understand.

Copy link
Contributor Author

@CheyuWu CheyuWu Jul 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But, I think this is a little awkward, we cannot directly change GetPluginName, because

case schedulerplugins.GetPluginName():

If we need to change batchScheduler to scheduler-plugins, the code will probably be

const (
	schedulerName                 string = "scheduler-plugins"
+      defaultSchedulerName     string = "scheduler-plugins-scheduler"
	kubeSchedulerPodGroupLabelKey string = "scheduling.x-k8s.io/pod-group"
)

func GetPluginName() string {
	return schedulerName
}

func (k *KubeScheduler) Name() string {
	return defaultSchedulerName -> Is this fine to change something like this?
}

I am not sure if there is a better idea

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, user experience is more important so this is fine to me. However, we'll need good variable naming and comments explaining why there are two names and their corresponding responsibility.

@CheyuWu
Copy link
Contributor Author

CheyuWu commented Jul 15, 2025

Hi @troychiu, PTAL

  • I have updated the batchScheduler with scheduler-plugins
  • Assign pods with the default Scheduler name scheduler-plugins-scheduler
  • Add comments
  • Update the manual testing

@troychiu
Copy link
Contributor

@kevin85421 do you mind taking a look? Thank you!

Copy link
Member

@kevin85421 kevin85421 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@kevin85421 kevin85421 merged commit b6bcf10 into ray-project:master Jul 21, 2025
25 checks passed
laurafitzgerald pushed a commit to laurafitzgerald/kuberay that referenced this pull request Jul 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[scheduler-plugins] Support second scheduler mode
3 participants