[FLINK-32033][Kubernetes-Operator] Fix Lifecycle Status of FlinkDeployment Resource in case of MISSING/ERROR JM status #997

nishita-09 · 2025-07-11T21:42:31Z

What is the purpose of the change

Currently the lifecycle state shows STABLE even if an application deployment was deleted and stays in MISSING / reconciling state. This would fix the lifecycle status of the deployment which would be FAILED in cases when the JM is in MISSING/ERROR state with ERROR: configmaps have been deleted indicating TERMINAL error.

Brief change log

Added logic to detect unrecoverable FlinkDeployment scenarios
Inserted after existing job status FAILED check
*Checks for JobManagerStatus:
- MISSING JobManager Deployment with error ({"type":"org.apache.flink.kubernetes.operator.exception.UpgradeFailureException","message":"HA metadata not available to restore from last state. It is possible that the job has finished or terminally failed, or the configmaps have been deleted. ","additionalMetadata":{},"throwableList":[]}) → Return FAILED lifecycle state
- MISSING JobManager Deployment (no error or recoverable error) → Return DEPLOYED lifecycle state
- ERROR JobManager Deployment without terminal error -> Return DEPLOYED lifecycle state

Verifying this change

This change added tests and can be verified as follows:

Added Unit Tests in ResourceLifecycleMetricsTest to validate Lifecycle Status for flinkDeployment
Manually verified the change by running a cluster with 3 sets of Application Flink Clusters:
- Application Deployment 1 -> with invalid image -> ERRORED JobManager Deployment Status -> DEPLOYED Lifecycle Status of flinkDeployment
- Application Deployment 2 -> with valid image -> Deleted JobManager Deployment -> Caused Missing JM Status -> FAILED Lifecycle Status of flinkDeployment
- Application Deployment 3 -> with valid image ->with invalid node selector -> Jobmanager Deployment Status: DEPLOYING -> Hence the Lifecycle Status is STABLE

Does this pull request potentially affect one of the following parts:

Dependencies (does it add or upgrade a dependency): no
The public API, i.e., is any changes to the CustomResourceDescriptors: no
Core observer or reconciler logic that is regularly executed: yes

Documentation

Does this pull request introduce a new feature? no
If yes, how is the feature documented? (not applicable / docs / JavaDocs / not documented)

…SSING/ERROR JM status

…SSING/ERROR JM status with unrecoverable error

gyfora · 2025-07-14T07:21:41Z

...operator-api/src/main/java/org/apache/flink/kubernetes/operator/api/status/CommonStatus.java

+                    && (error.toLowerCase()
+                                    .contains(
+                                            "it is possible that the job has finished or terminally failed, or the configmaps have been deleted")
+                            || error.toLowerCase().contains("manual restore required")
+                            || error.toLowerCase().contains("ha metadata not available")
+                            || error.toLowerCase()
+                                    .contains(
+                                            "ha data is not available to make stateful upgrades"))) {


Why are we checking this specific error?
In any case we are the ones triggering this error so please create a constant in the AbstractJobReconciler and use that here

@gyfora this seems to be the only case when we know that the cluster cannot recover on its own and needs a manual restore. hence used this. Will set this as a constant instead for a cleaner code.

@gyfora

There are multiple instances where HA metadata not available is written in different forms like HA metadata not available and HA data is not available. Should we maintain a uniformity in these by changing these exception messages using a constant (now that it is available).

Also currently flink-operator-api does not have flink-operator as a dependency -> to use the constants in AbstractJobReconciler we would have to import it as a dependency as the status change logic resides in flink-operator-api.
Should I still go ahead with this?

If possible let's use a single constant, and we can keep that constant in the operator api module so the reconciler can use it

@gyfora
I have added 3 constants for error messages which are frequently used and would mean that they are terminal, and referenced those in the reconcilers to maintain uniformity. I have also tried to keep the net changes minimum (Although a few error messages would differ slightly). Do let me know if this looks good?

…ing constants to maintain uniformity.

[FLINK-32033][Kubernetes-Operator] Fix Lifecycle status in case of MI…

188a27d

…SSING/ERROR JM status

nishita-09 changed the title ~~[FLINK-32033][Kubernetes-Operator] Fix Lifecycle status in case of MISSING/ERROR JM status~~ [FLINK-32033][Kubernetes-Operator] Fix Lifecycle Status of FlinkDeployment Resource in case of MISSING/ERROR JM status Jul 11, 2025

nishita-pattanayak added 2 commits July 13, 2025 12:57

[FLINK-32033][Kubernetes-Operator] Fix Lifecycle status in case of MI…

150bee1

…SSING/ERROR JM status with unrecoverable error

[FLINK-32033][Kubernetes-Operator] Fix Lifecycle status in case of MI…

45ba37d

…SSING/ERROR JM status with unrecoverable error

gyfora requested changes Jul 14, 2025

View reviewed changes

[FLINK-32033][Kubernetes-Operator] Refactor the exception messages us…

b68c097

…ing constants to maintain uniformity.

nishita-09 requested a review from gyfora July 14, 2025 14:38

nishita-pattanayak added 2 commits July 14, 2025 20:09

[FLINK-32033][Kubernetes-Operator] Refactor the exception messages us…

2bf53e9

…ing constants to maintain uniformity.

[FLINK-32033][Kubernetes-Operator] Refactor the exception messages us…

313fd6c

…ing constants to maintain uniformity.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[FLINK-32033][Kubernetes-Operator] Fix Lifecycle Status of FlinkDeployment Resource in case of MISSING/ERROR JM status #997

[FLINK-32033][Kubernetes-Operator] Fix Lifecycle Status of FlinkDeployment Resource in case of MISSING/ERROR JM status #997

Uh oh!

nishita-09 commented Jul 11, 2025 •

edited

Loading

Uh oh!

gyfora Jul 14, 2025

Uh oh!

nishita-09 Jul 14, 2025

Uh oh!

nishita-09 Jul 14, 2025

Uh oh!

gyfora Jul 14, 2025

Uh oh!

nishita-09 Jul 14, 2025

Uh oh!

Uh oh!

[FLINK-32033][Kubernetes-Operator] Fix Lifecycle Status of FlinkDeployment Resource in case of MISSING/ERROR JM status #997

Are you sure you want to change the base?

[FLINK-32033][Kubernetes-Operator] Fix Lifecycle Status of FlinkDeployment Resource in case of MISSING/ERROR JM status #997

Uh oh!

Conversation

nishita-09 commented Jul 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What is the purpose of the change

Brief change log

Verifying this change

Does this pull request potentially affect one of the following parts:

Documentation

Uh oh!

gyfora Jul 14, 2025

Choose a reason for hiding this comment

Uh oh!

nishita-09 Jul 14, 2025

Choose a reason for hiding this comment

Uh oh!

nishita-09 Jul 14, 2025

Choose a reason for hiding this comment

Uh oh!

gyfora Jul 14, 2025

Choose a reason for hiding this comment

Uh oh!

nishita-09 Jul 14, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

nishita-09 commented Jul 11, 2025 •

edited

Loading