Skip to content

[RayCluster] Toggle usage of deterministic/non-deterministic head pod name with feature flag #3873

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

machichima
Copy link
Contributor

@machichima machichima commented Jul 16, 2025

Why are these changes needed?

KubeRay 1.4.0 uses a deterministic name for the head pod. This, unfortunately, breaks the Autoscaler v2 for Ray < 2.48 after a head node restart under GCS FT configuration.

Based on #3868 (comment), we shouldn't change the behavior in the operator based on the Ray version. Therefor, we decided to add a feature flag that can toggle between using deterministic or non-deterministic head pod name (default to use non-deterministic head pod name)

  • Introduce feature flag ENABLE_DETERMINISTIC_HEAD_POD_NAME to set whether we want to use deterministic name for head pod
  • Add test for toggling type of head pod name based on the feature flag

Related issue number

Closes #3868

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

@machichima machichima marked this pull request as ready for review July 16, 2025 13:16
@machichima
Copy link
Contributor Author

Tested the feature flag locally

  • Set ENABLE_DETERMINISTIC_HEAD_POD_NAME to true
image
  • Set ENABLE_DETERMINISTIC_HEAD_POD_NAME to false
image

Signed-off-by: machichima <[email protected]>
@machichima
Copy link
Contributor Author

@rueian PTAL, thank you!

@machichima
Copy link
Contributor Author

I added the test in ray-operator/controllers/ray/utils/util_test.go and test manually. Is it needed to add an e2e test in this PR?

Comment on lines 718 to 722
enableDeterministicHeadPodName := false
if s := os.Getenv(ENABLE_DETERMINISTIC_HEAD_POD_NAME); strings.ToLower(s) == "true" {
enableDeterministicHeadPodName = true
}
return enableDeterministicHeadPodName
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
enableDeterministicHeadPodName := false
if s := os.Getenv(ENABLE_DETERMINISTIC_HEAD_POD_NAME); strings.ToLower(s) == "true" {
enableDeterministicHeadPodName = true
}
return enableDeterministicHeadPodName
return strings.ToLower(os.Getenv(ENABLE_DETERMINISTIC_HEAD_POD_NAME)) == "true"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed! Thanks!

@rueian
Copy link
Contributor

rueian commented Jul 16, 2025

We can have an e2e test for the autoscaler restart with GCS FT in another PR, since it has nothing to do with the feature flag.

@machichima
Copy link
Contributor Author

Hi @rueian ,
While this PR (#3872) is merged, should we close this then?

@rueian
Copy link
Contributor

rueian commented Jul 16, 2025

Hi @rueian , While this PR (#3872) is merged, should we close this then?

We still need this. Please solve the conflict.

@machichima
Copy link
Contributor Author

@rueian I've resolved the conflicts. PTAL, thank you!

@rueian rueian requested a review from kevin85421 July 19, 2025 02:47
Comment on lines +179 to +181
# If set to true, we will use deterministic name for head pod. Otherwise, the non-deterministic name is used.
# - name: ENABLE_DETERMINISTIC_HEAD_POD_NAME
# value: "false"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By default we should enable this feature to fix this issue:
#3013

WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh sorry, I missed this PR:
#3872

@rueian rueian merged commit 04f5b71 into ray-project:master Jul 19, 2025
25 checks passed
laurafitzgerald pushed a commit to laurafitzgerald/kuberay that referenced this pull request Jul 25, 2025
… name with feature flag (ray-project#3873)

* feat: use gen name for ray version > 2.48

Signed-off-by: machichima <[email protected]>

* feat: head pod name to non deterministic

Signed-off-by: machichima <[email protected]>

* test: compare head pod name between old and new

Signed-off-by: machichima <[email protected]>

* feat: add feature flag for deterministic head pod name

Signed-off-by: machichima <[email protected]>

* feat: use deterministic if feature flag set to true

Signed-off-by: machichima <[email protected]>

* test: head pod name with/without feature flag == true

Signed-off-by: machichima <[email protected]>

* test: sep test for head/worker pod name

Signed-off-by: machichima <[email protected]>

* docs: add feature flag desc to helm chart

Signed-off-by: machichima <[email protected]>

* fix: feature flag name

Signed-off-by: machichima <[email protected]>

* refactor: simplify check if the feature flag is set

Signed-off-by: machichima <[email protected]>

---------

Signed-off-by: machichima <[email protected]>
DW-Han pushed a commit to DW-Han/kuberay that referenced this pull request Jul 30, 2025
… name with feature flag (ray-project#3873)

* feat: use gen name for ray version > 2.48

Signed-off-by: machichima <[email protected]>

* feat: head pod name to non deterministic

Signed-off-by: machichima <[email protected]>

* test: compare head pod name between old and new

Signed-off-by: machichima <[email protected]>

* feat: add feature flag for deterministic head pod name

Signed-off-by: machichima <[email protected]>

* feat: use deterministic if feature flag set to true

Signed-off-by: machichima <[email protected]>

* test: head pod name with/without feature flag == true

Signed-off-by: machichima <[email protected]>

* test: sep test for head/worker pod name

Signed-off-by: machichima <[email protected]>

* docs: add feature flag desc to helm chart

Signed-off-by: machichima <[email protected]>

* fix: feature flag name

Signed-off-by: machichima <[email protected]>

* refactor: simplify check if the feature flag is set

Signed-off-by: machichima <[email protected]>

---------

Signed-off-by: machichima <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug] Restore random head pod name when Ray < 2.48 and Autoscaler v2 and GCS FT are enabled
3 participants