Skip to content

Feature/cron scheduling rayjob 2426 #3836

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 22 commits into
base: master
Choose a base branch
from

Conversation

DW-Han
Copy link
Contributor

@DW-Han DW-Han commented Jun 26, 2025

Why are these changes needed?

Adding cron scheduling for RayJob

Related issue number

Resolves #2426 [Feature] Support cron scheduling for RayJob

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

Explanation

I thought about using the default kubernetes CronJob Resource and have it run a RayJob given a cron string, which might have been fine, but this would change a lot of the structure of the current RayJob controller or maybe would require something like a CronRayJob CRD. I decided to go with a lighter weight solution of added JobDeploymentStatusScheduling, JobDeploymentStatusScheduled, and JobStatusScheduled (JobStatusScheduled is not entirely needed in the logic, but I wanted to be explicit).

Whenever the job in complete instead of not requeueing, we check if the Spec.Schedule is empty, its its not then we set the job as JobDeploymentStatusScheduling. This goes through the same logic as the existing JobDeploymentStatusSuspending and JobDeploymentStatusRetrying, deleting the cluster, job, etc. to free up resources until the next scheduled job.

JobDeploymentStatusScheduling transitions to JobDeploymentStatusScheduled which checks whether we are at time or not by seeing if we are within a ScheduleDelta time of a cron tick. If so we run a job, setting JobStatusNew and JobDeploymentStatusNew again. If not we requeue after NextScheduleTimeDuration.

Untitled Diagram drawio (1)

The reason i used a ScheduleDelta was to be more robust when we schedule since there could be drifts from the requeue by processing speed maybe, also as the only way to know when to transition out of scheduled.
I thought about just queuing and transitioning state and realized that it would actually be difficult in the current logic since all the status updates are done in a since "updateRayJobStatus" after the large switch statement which means that I have to requeue with the NextScheduleTimeDuration at the end of the reconcile function. This approach seems fine but I was facing some issues with reconciling errors, but I want to hear your thoughts. I don't know if this approach is optimal. Any feedback would be great

Updates

  • Made it so that the job doesn't instantly run when created, instead it goes straight to scheduled for the next cron tick
  • Added unit tests
  • Cleaned up code
  • I realized we only need to check if the current time is within a buffer of the previous tick since we will reconcile to the future tick anyways
  • Now has the expected behavior of deleting or reusing clusters with the shutdownAfterJobFinishes spec
  • Added unit tests for the schedule.go util file

@DW-Han DW-Han marked this pull request as ready for review June 26, 2025 09:08
@DW-Han DW-Han marked this pull request as draft June 26, 2025 09:10
@DW-Han DW-Han marked this pull request as ready for review June 26, 2025 20:15
@DW-Han DW-Han marked this pull request as draft June 26, 2025 20:19
@andrewsykim andrewsykim self-requested a review June 26, 2025 20:42
@andrewsykim andrewsykim self-assigned this Jun 26, 2025
@ryanaoleary ryanaoleary self-requested a review June 26, 2025 22:23
Copy link
Contributor

@chiayi chiayi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have some initial comments and questions. pls add unit tests as well.

Copy link
Contributor

@chiayi chiayi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few more comments, PTAL

Copy link
Contributor

@chiayi chiayi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@DW-Han DW-Han marked this pull request as ready for review July 14, 2025 20:55
@andrewsykim
Copy link
Member

Please add unit tests in https://github.com/ray-project/kuberay/blob/master/ray-operator/controllers/ray/rayjob_controller_test.go as well to cover the cron scheduling case

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature] Support cron scheduling for RayJob
3 participants