Skip to content

[9.0] [ML] Ensure that anomaly detection job state update retries if master node is temoporarily unavailable (#129391) #129402

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Jun 16, 2025

Conversation

valeriy42
Copy link
Contributor

Backports the following commits to 9.0:

… node is temoporarily unavailable (elastic#129391)

During cluster upgrade, the anomaly detection jobs must be reassigned from one ML node to another. During this reassignment, the jobs transition through several states, including "opening" and "opened". If, during this transition, the master node becomes temporarily unavailable, e.g., due to reassignment, the new job state is not successfully committed to the cluster state. Therefore, once the new master became available, the cluster state was inconsistent: some anomaly detection jobs were opened, but their state got stuck as "opening".

This PR introduces a retryable action for updating the job state to ensure that the job state is successfully updated and the cluster state remains consistent during the upgrade.

Fixes elastic#126148
@valeriy42 valeriy42 added :ml Machine learning >bug auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) backport cloud-deploy Publish cloud docker image for Cloud-First-Testing Team:ML Meta label for the ML team labels Jun 13, 2025
@valeriy42 valeriy42 removed the cloud-deploy Publish cloud docker image for Cloud-First-Testing label Jun 13, 2025
@elasticsearchmachine elasticsearchmachine merged commit d4692d2 into elastic:9.0 Jun 16, 2025
17 checks passed
@valeriy42 valeriy42 deleted the backport/9.0/pr-129391 branch June 16, 2025 09:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) backport >bug :ml Machine learning Team:ML Meta label for the ML team v9.0.3
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants