Skip to content

[FLINK-32033][Kubernetes-Operator] Fix Lifecycle Status of FlinkDeployment Resource in case of MISSING/ERROR JM status #997

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

nishita-09
Copy link

@nishita-09 nishita-09 commented Jul 11, 2025

What is the purpose of the change

Currently the lifecycle state shows STABLE even if an application deployment was deleted and stays in MISSING / reconciling state. This would fix the lifecycle status of the deployment which would be FAILED in cases when the JM is in MISSING/ERROR state with ERROR: configmaps have been deleted indicating TERMINAL error.

Brief change log

  • Added logic to detect unrecoverable FlinkDeployment scenarios
  • Inserted after existing job status FAILED check
  • *Checks for JobManagerStatus:
    • MISSING JobManager Deployment with error ({"type":"org.apache.flink.kubernetes.operator.exception.UpgradeFailureException","message":"HA metadata not available to restore from last state. It is possible that the job has finished or terminally failed, or the configmaps have been deleted. ","additionalMetadata":{},"throwableList":[]}) → Return FAILED lifecycle state
    • MISSING JobManager Deployment (no error or recoverable error) → Return DEPLOYED lifecycle state
    • ERROR JobManager Deployment without terminal error -> Return DEPLOYED lifecycle state

Verifying this change

This change added tests and can be verified as follows:

  • Added Unit Tests in ResourceLifecycleMetricsTest to validate Lifecycle Status for flinkDeployment
  • Manually verified the change by running a cluster with 3 sets of Application Flink Clusters:
    • Application Deployment 1 -> with invalid image -> ERRORED JobManager Deployment Status -> DEPLOYED Lifecycle Status of flinkDeployment
    • Application Deployment 2 -> with valid image -> Deleted JobManager Deployment -> Caused Missing JM Status -> FAILED Lifecycle Status of flinkDeployment
    • Application Deployment 3 -> with valid image ->with invalid node selector -> Jobmanager Deployment Status: DEPLOYING -> Hence the Lifecycle Status is STABLE
Screenshot 2025-07-13 at 4 35 19 AM Screenshot 2025-07-13 at 4 35 30 AM Screenshot 2025-07-13 at 4 35 39 AM Screenshot 2025-07-13 at 1 02 20 PM Screenshot 2025-07-13 at 1 03 24 PM

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): no
  • The public API, i.e., is any changes to the CustomResourceDescriptors: no
  • Core observer or reconciler logic that is regularly executed: yes

Documentation

  • Does this pull request introduce a new feature? no
  • If yes, how is the feature documented? (not applicable / docs / JavaDocs / not documented)

@nishita-09 nishita-09 changed the title [FLINK-32033][Kubernetes-Operator] Fix Lifecycle status in case of MISSING/ERROR JM status [FLINK-32033][Kubernetes-Operator] Fix Lifecycle Status of FlinkDeployment Resource in case of MISSING/ERROR JM status Jul 11, 2025
nishita-pattanayak added 2 commits July 13, 2025 12:57
…SSING/ERROR JM status with unrecoverable error
…SSING/ERROR JM status with unrecoverable error
Comment on lines 103 to 110
&& (error.toLowerCase()
.contains(
"it is possible that the job has finished or terminally failed, or the configmaps have been deleted")
|| error.toLowerCase().contains("manual restore required")
|| error.toLowerCase().contains("ha metadata not available")
|| error.toLowerCase()
.contains(
"ha data is not available to make stateful upgrades"))) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we checking this specific error?
In any case we are the ones triggering this error so please create a constant in the AbstractJobReconciler and use that here

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gyfora this seems to be the only case when we know that the cluster cannot recover on its own and needs a manual restore. hence used this. Will set this as a constant instead for a cleaner code.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gyfora

  1. There are multiple instances where HA metadata not available is written in different forms like HA metadata not available and HA data is not available. Should we maintain a uniformity in these by changing these exception messages using a constant (now that it is available).

  2. Also currently flink-operator-api does not have flink-operator as a dependency -> to use the constants in AbstractJobReconciler we would have to import it as a dependency as the status change logic resides in flink-operator-api.
    Should I still go ahead with this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If possible let's use a single constant, and we can keep that constant in the operator api module so the reconciler can use it

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gyfora
I have added 3 constants for error messages which are frequently used and would mean that they are terminal, and referenced those in the reconcilers to maintain uniformity. I have also tried to keep the net changes minimum (Although a few error messages would differ slightly). Do let me know if this looks good?

@nishita-09 nishita-09 requested a review from gyfora July 14, 2025 14:38
nishita-pattanayak added 2 commits July 14, 2025 20:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants