-
Notifications
You must be signed in to change notification settings - Fork 572
Feature/cron scheduling rayjob 2426 #3836
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Feature/cron scheduling rayjob 2426 #3836
Conversation
Signed-off-by: Kenny Han <[email protected]>
Signed-off-by: Kenny Han <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have some initial comments and questions. pls add unit tests as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Few more comments, PTAL
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Please add unit tests in https://github.com/ray-project/kuberay/blob/master/ray-operator/controllers/ray/rayjob_controller_test.go as well to cover the cron scheduling case |
Why are these changes needed?
Adding cron scheduling for RayJob
Related issue number
Resolves #2426 [Feature] Support cron scheduling for RayJob
Checks
Explanation
I thought about using the default kubernetes CronJob Resource and have it run a RayJob given a cron string, which might have been fine, but this would change a lot of the structure of the current RayJob controller or maybe would require something like a CronRayJob CRD. I decided to go with a lighter weight solution of added JobDeploymentStatusScheduling, JobDeploymentStatusScheduled, and JobStatusScheduled (JobStatusScheduled is not entirely needed in the logic, but I wanted to be explicit).
Whenever the job in complete instead of not requeueing, we check if the Spec.Schedule is empty, its its not then we set the job as JobDeploymentStatusScheduling. This goes through the same logic as the existing JobDeploymentStatusSuspending and JobDeploymentStatusRetrying, deleting the cluster, job, etc. to free up resources until the next scheduled job.
JobDeploymentStatusScheduling transitions to JobDeploymentStatusScheduled which checks whether we are at time or not by seeing if we are within a ScheduleDelta time of a cron tick. If so we run a job, setting JobStatusNew and JobDeploymentStatusNew again. If not we requeue after NextScheduleTimeDuration.
The reason i used a ScheduleDelta was to be more robust when we schedule since there could be drifts from the requeue by processing speed maybe, also as the only way to know when to transition out of scheduled.
I thought about just queuing and transitioning state and realized that it would actually be difficult in the current logic since all the status updates are done in a since "updateRayJobStatus" after the large switch statement which means that I have to requeue with the NextScheduleTimeDuration at the end of the reconcile function. This approach seems fine but I was facing some issues with reconciling errors, but I want to hear your thoughts. I don't know if this approach is optimal. Any feedback would be great
Updates