-
Notifications
You must be signed in to change notification settings - Fork 478
[FLINK-32033][Kubernetes-Operator] Fix Lifecycle Status of FlinkDeployment Resource in case of MISSING/ERROR JM status #997
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…SSING/ERROR JM status
…SSING/ERROR JM status with unrecoverable error
…SSING/ERROR JM status with unrecoverable error
&& (error.toLowerCase() | ||
.contains( | ||
"it is possible that the job has finished or terminally failed, or the configmaps have been deleted") | ||
|| error.toLowerCase().contains("manual restore required") | ||
|| error.toLowerCase().contains("ha metadata not available") | ||
|| error.toLowerCase() | ||
.contains( | ||
"ha data is not available to make stateful upgrades"))) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we checking this specific error?
In any case we are the ones triggering this error so please create a constant in the AbstractJobReconciler
and use that here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gyfora this seems to be the only case when we know that the cluster cannot recover on its own and needs a manual restore. hence used this. Will set this as a constant instead for a cleaner code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
There are multiple instances where
HA metadata not available
is written in different forms likeHA metadata not available
andHA data is not available
. Should we maintain a uniformity in these by changing these exception messages using a constant (now that it is available). -
Also currently
flink-operator-api
does not haveflink-operator
as a dependency -> to use the constants inAbstractJobReconciler
we would have to import it as a dependency as the status change logic resides inflink-operator-api
.
Should I still go ahead with this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If possible let's use a single constant, and we can keep that constant in the operator api module so the reconciler can use it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gyfora
I have added 3 constants for error messages which are frequently used and would mean that they are terminal, and referenced those in the reconcilers to maintain uniformity. I have also tried to keep the net changes minimum (Although a few error messages would differ slightly). Do let me know if this looks good?
…ing constants to maintain uniformity.
…ing constants to maintain uniformity.
…ing constants to maintain uniformity.
What is the purpose of the change
Currently the lifecycle state shows STABLE even if an application deployment was deleted and stays in MISSING / reconciling state. This would fix the lifecycle status of the deployment which would be FAILED in cases when the JM is in MISSING/ERROR state with ERROR:
configmaps have been deleted
indicating TERMINAL error.Brief change log
FlinkDeployment
scenariosJobManagerStatus
:({"type":"org.apache.flink.kubernetes.operator.exception.UpgradeFailureException","message":"HA metadata not available to restore from last state. It is possible that the job has finished or terminally failed, or the configmaps have been deleted. ","additionalMetadata":{},"throwableList":[]})
→ Return FAILED lifecycle stateVerifying this change
This change added tests and can be verified as follows:
ResourceLifecycleMetricsTest
to validate Lifecycle Status forflinkDeployment
flinkDeployment
JobManager Deployment
-> CausedMissing
JM Status -> FAILED Lifecycle Status offlinkDeployment
Jobmanager Deployment Status:
DEPLOYING
-> Hence the Lifecycle Status isSTABLE
Does this pull request potentially affect one of the following parts:
CustomResourceDescriptors
: noDocumentation