-
Notifications
You must be signed in to change notification settings - Fork 21
[WIP] Fix issue 251 peer control nodes #255
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[WIP] Fix issue 251 peer control nodes #255
Conversation
Hi @mark-scott-jr-dell. Thanks for your PR. I'm waiting for a medik8s member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the late first review, we are pretty busy...
pkg/apicheck/check.go
Outdated
//canOtherControlPlanesBeReached := c.canOtherControlPlanesBeReached() | ||
peersResponse = c.getPeersResponse(peers.ControlPlane) | ||
|
||
// MES: This does not appear to have any actual relevance. To me, it appears that all the necessary |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IsControlPlaneHealthy()
not being relevant is a bold statement ;)
However, before going on with a more detailed review, I think it makes sense to first write down the expected flow, for worker nodes, for control plane nodes, when API server is available and when not, when we have peers or not, which peers to ask, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @mshitrit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it makes sense to first write down the expected flow, for worker nodes, for control plane nodes, when API server is available and when not, when we have peers or not, which peers to ask, etc.
+1
I think this change significantly changes current logic.
Couple of things I've noticed:
- In the new code for CP nodes we completely ignore feedback of worker nodes. for most use case worker nodes can accurately report the status of the CP nodes and even though I expect the CP peers to report the same I'm not sure that ignoring the worker peers would be the best option.
- diagnostic logic (i.e
isDiagnosticsPassed()
) is removed, which means the node can be falsely considered healthy for some use cases
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IsControlPlaneHealthy()
not being relevant is a bold statement ;)
Haha I did say "does not APPEAR to have any actual relevence" to be fair, based on my observations. I definitely left room in there for me to be wrong 😂.
However, before going on with a more detailed review, I think it makes sense to first write down the expected flow, for worker nodes, for control plane nodes, when API server is available and when not, when we have peers or not, which peers to ask, etc.
This would help a lot, my actual code changes were based on how I understood the expected flow to go, I attempted to interpret this based on the intention I saw in the code. My goal was to attempt not to change far too much and keep behaviors the same today since I personally don't know all the intentions, nor did I find it documented in detail anywhere (correct me if I'm wrong to be sure!).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it makes sense to first write down the expected flow, for worker nodes, for control plane nodes, when API server is available and when not, when we have peers or not, which peers to ask, etc.
+1
I think this change significantly changes current logic. Couple of things I've noticed:
- In the new code for CP nodes we completely ignore feedback of worker nodes. for most use case worker nodes can accurately report the status of the CP nodes and even though I expect the CP peers to report the same I'm not sure that ignoring the worker peers would be the best option.
- diagnostic logic (i.e
isDiagnosticsPassed()
) is removed, which means the node can be falsely considered healthy for some use cases
I'll look back at this later today to respond, especially w.r.t isDiagnosticsPassed, but I did spend some time walking through the flows and found multiple checks that basically referenced the same data multiple times, so I was attempting to simplify so it was clear what the code was doing. I felt it was unclear visually what was actually going on.
Ultimately, if we use the updated unit test just to prove out the core issue, I'm still good - our goal was to prove it so that it could be fixed, since it's a PITA to get logs in that case due to the nature of the cluster status at that point, so I personally picked the stretch goal of creating the unit test which would be better for the long term (in theory).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mshitrit Not sure if you've had a chance to think about the logic flows here, but, if I'm reading things correctly, an update like this?
- Attempt to get control plane responses
- Attempt to get worker responses
- Some combination of these should say that the node is healthy. Also add in isDiagnosticPassed
I'm willing to implement it and push a new PR, just want to be sure that I use the flow that you have in mind.
We are looking to pull the latest build whenever this is merged to main, and get some needed CVE fixes and other things, so I'd love to drive this to a close ASAP. I thought I had posted this message a week and a half ago but I guess it went into the ether.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi I think that pretty close, writing down both current and what I understand is the desired flow:
- Attempt to get worker responses
- If a worker return that response
- else (assuming it's a control plane)
- get a control plane response
- Some combination of these should say that the node is healthy. Also add in isDiagnosticPassed
IIUC the fix is aiming for this flow:
- Attempt to get control plane responses
- Attempt to get worker responses
- If a worker Some combination of these should say that the node is healthy (CP response is only relevant for some use cases of a worker node healthy response otherwise it can be ignored)
- else Some combination of these should say that the node is healthy. Also add in isDiagnosticPassed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mshitrit I pushed a set of changes to IsConsideredHealthy which I believe matches what you have described as the desired flow. Let me know your thoughts.
} | ||
|
||
func (c *ApiConnectivityCheck) getWorkerPeersResponse() peers.Response { | ||
func (c *ApiConnectivityCheck) getPeersResponse(role peers.Role) peers.Response { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this refactoring 👍
eb2c8c8
to
6397c3e
Compare
controlPlanePeersResponse := c.getPeersResponse(peers.ControlPlane) | ||
|
||
c.config.Log.Info("isConsideredHealthy: since peers think I'm unhealthy, double checking "+ | ||
"by returning what the control plane nodes think of my state", | ||
"controlPlanePeersResponse.IsHealthy", controlPlanePeersResponse.IsHealthy) | ||
return controlPlanePeersResponse.IsHealthy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC basically this means that for a worker node the CP peers response will override the Worker peers response (unless worker peers response is healthy).
It completely ignores why the worker response was unhealthy.
TBH I'm having a hard time of thinking of an example this would not work as expected, but ignoring the reason still feels dangerous.
@slintes maybe you have a stronger opinion either way ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mshitrit I hear you, I still don't necessarily have a full feeling of what the fully desired behavior, so trying to translate what I am hearing from you and what I've seen overall.
Definitely will change it to match what the medik8s team thinks is appropriate. My main goal was to prove the issue with the unit test, and attempt to come in with a solution rather than just toss the problem in your collective laps. Intention is not to change intended behaviors, especially since it could have impacts on existing installations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are still discussing details on the PR without knowing what we're aiming at.
Again, can write down the expected flow first?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My apologies, @slintes I'm honestly looking for ya'lls feedback on what the expected flow is. I identified a problem case, wherein a control plane node can go down. I didn't fully know what the expected flow was supposed to be, but I did know in this case it was wrong.
So, the unit test proves the issue, which I believe could affect others utilizing SNR in different ways. I tried to provide some solution, but am totally ok if it's not the right solution.
If you are asking me what the expected flow is, I can sit down and draft out what I think it should be for the entire system for CP nodes as well as worker nodes, and am happy to do it, but I was hoping to start with a baseline of what the medik8s team believed the workflow was supposed to theoretically believe.
So, if this reply is directed at me, I'd ask that you be more specific, are you asking me to write down the expected flow for the entire system? (Which is what I seem to have possibly gotten incorrect in my proposed solution)
We are still discussing details on the PR without knowing what we're aiming at. Again, can write down the expected flow first?
If so, I'm up for taking a stab at it, but I don't have the background of why existing decisions were made.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I identified a problem case
the unit test proves the issue
And that's great, thank you 👍🏼
But I think the code changes significantly change the existing flow of things, which desires some general discussion of what the expected flow should be.
If you are asking me what the expected flow is
No no, it was a general ask, sorry if it sounded inappropriate. I would do it myself if I had more time for this...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Our docs contain some diagrams. As a first step we can verify if they are up to date and aligned with the current code. And then if they still make sense. Is the issue is even visible there?
https://www.medik8s.io/remediation/self-node-remediation/how-it-works/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I identified a problem case
the unit test proves the issueAnd that's great, thank you 👍🏼 But I think the code changes significantly change the existing flow of things, which desires some general discussion of what the expected flow should be.
If you are asking me what the expected flow is
No no, it was a general ask, sorry if it sounded inappropriate. I would do it myself if I had more time for this...
Ok, understood.
Our docs contain some diagrams. As a first step we can verify if they are up to date and aligned with the current code. And then if they still make sense. Is the issue is even visible there?
https://www.medik8s.io/remediation/self-node-remediation/how-it-works/
It had been so long since our initial implementation I forgot these diagrams that exist. Let me review today and see if I can offer a proposal, given your time constraints, and update diagrams. Perhaps I can shorten the time required
WalkthroughThe changes introduce extensive logging and refactoring across the remediation, peer, and API connectivity check logic, as well as major enhancements to the test suite. The peer health check mechanism is now more flexible, supporting injected health-check functions and improved role-based peer querying. Tests are refactored for modularity, richer scenarios, and improved observability. Changes
Sequence Diagram(s)sequenceDiagram
participant Node
participant ApiConnectivityCheck
participant Peers
participant ControlPlaneManager
Node->>ApiConnectivityCheck: isConsideredHealthy()
ApiConnectivityCheck->>Peers: getPeersResponse(role)
Peers-->>ApiConnectivityCheck: Peer addresses
ApiConnectivityCheck->>ApiConnectivityCheck: getHealthStatusFromPeer (via injected func)
ApiConnectivityCheck-->>Node: Health status (aggregated)
Node->>ControlPlaneManager: IsControlPlane()
ControlPlaneManager-->>Node: Boolean (role)
Assessment against linked issues
Poem
✨ Finishing Touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
🧹 Nitpick comments (6)
pkg/apicheck/check.go (3)
209-213
: Log message prints the wrong variable
isControlPlaneHealthy
is computed a few lines above, but the log printscontrolPlanePeersResponse.IsHealthy
, duplicating earlier output and hiding the final decision.- c.config.Log.Info("isConsideredHealthy: we have checkd the control plane peer responses and cross "+ - "checked it against the control plane diagnostics ", - "isControlPlaneHealthy", controlPlanePeersResponse.IsHealthy) + c.config.Log.Info("isConsideredHealthy: evaluated peer responses & diagnostics", + "isControlPlaneHealthy", isControlPlaneHealthy)
223-225
: Misleading log text claims “I consider myself a WORKER” irrespective of actual roleThe hard-coded message talks about WORKER even when
role == peers.ControlPlane
.
Replace the literal withrole.String()
(or similar) to avoid confusion during incident triage.
386-388
: Setter lacks concurrency protection
SetHealthStatusFunc
may be called from tests whileApiConnectivityCheck
is running in a goroutine.
If that ever happens, the write is unsynchronised with reads ingetHealthStatusFromPeer
, leading to a data race.
Wrap the field access with the existingmutex
or document it as “write-once before Start()”.vendor/github.com/onsi/gomega/gcustom/make_matcher.go (1)
87-91
: Panic message could be clearerThe panic mentions “function that takes one argument and returns (bool, error)”, but omits the possibility of a typed first parameter which you explicitly support.
Consider:- panic("MakeMatcher must be passed a function that takes one argument and returns (bool, error)") + panic("MakeMatcher expects func(<any single param>) (bool, error)")controllers/tests/controller/selfnoderemediation_controller_test.go (2)
1044-1049
: Deep-equality onNode.Status
is brittle and frequently fails
reflect.DeepEqual
on the wholeStatus
block compares timestamps, resource versions, conditions, etc. that can legitimately change between the expected skeleton returned bygetNode
and the actual cluster object – even when the node is perfectly “equal” for test purposes.
This can introduce non-deterministic test failures.Recommendation
• Compare only deterministic fields (e.g., labels, taints) or use a semantic helper such asequality.Semantic.DeepEqual
with a well-scoped struct.
• Alternatively, omitStatus
from the comparison altogether unless a specific field is being asserted.
790-791
: Unnecessarytime.Sleep
slows the suiteA fixed
time.Sleep(1 * time.Second)
immediately before anEventually
poll adds a full second to every invocation ofcreateGenericSelfNodeRemediationPod
.
Eventually
already waits until the pod is observed; the explicit sleep can be safely removed to cut runtime.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge Base: Disabled due to data retention organization setting
⛔ Files ignored due to path filters (2)
pkg/peerhealth/peerhealth.pb.go
is excluded by!**/*.pb.go
pkg/peerhealth/peerhealth_grpc.pb.go
is excluded by!**/*.pb.go
📒 Files selected for processing (12)
controllers/selfnoderemediation_controller.go
(3 hunks)controllers/tests/config/suite_test.go
(1 hunks)controllers/tests/controller/selfnoderemediation_controller_test.go
(11 hunks)controllers/tests/controller/suite_test.go
(3 hunks)controllers/tests/shared/shared.go
(3 hunks)go.mod
(1 hunks)pkg/apicheck/check.go
(5 hunks)pkg/controlplane/manager.go
(2 hunks)pkg/peers/peers.go
(4 hunks)pkg/utils/pods.go
(3 hunks)vendor/github.com/onsi/gomega/gcustom/make_matcher.go
(1 hunks)vendor/modules.txt
(1 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (3)
controllers/tests/config/suite_test.go (1)
controllers/tests/shared/shared.go (1)
MinPeersForRemediationConfigDefaultValue
(47-47)
controllers/tests/controller/suite_test.go (1)
controllers/tests/shared/shared.go (3)
ApiConnectivityCheckWrapper
(59-65)MinPeersForRemediationConfigDefaultValue
(47-47)NewApiConnectivityCheckWrapper
(109-131)
controllers/tests/shared/shared.go (4)
pkg/apicheck/check.go (3)
ApiConnectivityCheck
(28-37)New
(59-70)ApiConnectivityCheckConfig
(41-57)api/HealthCheckResponse.go (1)
HealthCheckResponseCode
(3-3)pkg/controlplane/manager.go (1)
Manager
(29-36)vendor/github.com/onsi/gomega/gcustom/make_matcher.go (1)
MakeMatcher
(87-128)
🪛 golangci-lint (1.64.8)
pkg/apicheck/check.go
110-110: S1023: redundant return
statement
(gosimple)
🔇 Additional comments (29)
go.mod (1)
25-25
: Explicit dependency declaration for UUID libraryMoving the UUID package from indirect to direct dependencies correctly reflects its usage in the codebase, specifically for generating random IPv6 addresses in test utilities.
vendor/modules.txt (1)
144-144
: Added Gomega custom matcher packageThe addition of
github.com/onsi/gomega/gcustom
in the vendor modules reflects the use of custom matchers in tests, which provides more flexible failure message formatting.pkg/controlplane/manager.go (2)
57-60
: Enhanced logging for control plane role verificationAdding detailed logging about node roles improves observability when checking if a node is a control plane node, which will help with debugging role-based issues.
138-140
: Added node labels logging during role assignmentThis logging enhancement provides visibility into node labels during role determination, which is valuable for troubleshooting control plane node identification issues, especially when dealing with peer control nodes.
controllers/tests/config/suite_test.go (1)
129-129
:✅ Verification successful
Updated minimum peers configuration constant
Changed from using
shared.MinPeersForRemediation
toshared.MinPeersForRemediationConfigDefaultValue
(value of 1) to align with the new API connectivity check wrapper's default configuration values.
🏁 Script executed:
#!/bin/bash # Verify the default value in the shared package grep -r "MinPeersForRemediationConfigDefaultValue" --include="*.go" controllers/tests/shared/Length of output: 174
Change Approved: Verified default peers configuration constant
Theshared.MinPeersForRemediationConfigDefaultValue
constant is defined as 1 incontrollers/tests/shared/shared.go
, matching the intended default. No further changes required.
- File updated:
controllers/tests/config/suite_test.go
(line 129)pkg/utils/pods.go (3)
5-5
: Good addition of the fmt package for improved error handling.The import of the
fmt
package is required for the enhanced error wrapping in the error handling sections.
23-24
: Excellent improvement to error handling with context.Using
fmt.Errorf
with the%w
verb provides better error context by wrapping the original error instead of losing the root cause. This change makes debugging much easier by preserving the full error chain.
33-33
: Good enhancement to error message clarity.The improved error message now explicitly includes the node name in the formatted string, making it easier to identify which node is affected when troubleshooting.
controllers/tests/controller/suite_test.go (3)
63-63
: Good update to variable type for enhanced testing capabilities.Changing the type to
*shared.ApiConnectivityCheckWrapper
allows for more flexible testing by enabling simulation of peer health check responses.
166-166
: Updated constant reference for better code maintenance.Changing from
shared.MinPeersForRemediation
toshared.MinPeersForRemediationConfigDefaultValue
improves code clarity by using a more descriptive constant name and aligns with updated constants in the shared test package.
168-170
: Good refactoring to use wrapper for API connectivity checks.The change from directly using
apicheck.New
to usingshared.NewApiConnectivityCheckWrapper
enhances testing capabilities by allowing simulation of peer responses, which is essential for thorough testing of peer control node scenarios.controllers/selfnoderemediation_controller.go (7)
453-453
: Good addition of informative logging at phase start.This logging statement improves observability by explicitly marking entry into the fencing start phase, making it easier to track the remediation workflow.
456-456
: Enhanced logging for pre-reboot phase entry.This logging statement improves traceability by clearly marking the transition to the pre-reboot completed phase.
459-459
: Improved phase transition visibility with logging.Adding explicit logging for entering the reboot completed phase enhances observability of the remediation workflow.
462-462
: Clear logging for fencing completion phase.This logging statement provides clear indication of reaching the final fencing complete phase in the remediation workflow.
466-466
: Enhanced error message with phase value inclusion.Including the actual phase value in the error message provides more context for troubleshooting unknown phase errors.
500-500
: Good indication of pre-reboot completion.Adding a log statement that clearly indicates when the pre-reboot phase is completed improves workflow visibility.
631-632
: Improved error context for node reboot capability.The expanded error message clearly explains the consequence of the error - that failure to get the agent pod makes the node not reboot capable, which is important context for troubleshooting.
pkg/peers/peers.go (9)
104-116
: Good addition of detailed logging and reset functionality.The refactored
updateWorkerPeers
method now includes comprehensive logging and a dedicated reset function, improving observability and state management. The use of closures for setter, getter, and reset functions makes the code more modular and testable.
119-134
: Well-structured implementation ofUpdateControlPlanePeers
.The new implementation follows the same pattern as
updateWorkerPeers
, providing consistent logging, clear state management, and a dedicated reset function. This parallelism in implementation makes the code more maintainable.
136-137
: Good enhancement toupdatePeers
with reset functionality.Adding the
resetPeers
parameter allows for more flexible peer state management, particularly in error recovery scenarios.
150-150
: Improved state management with reset function.Using the dedicated reset function instead of directly manipulating state improves encapsulation and ensures consistent state management across the codebase.
156-156
: Enhanced logging of node information.Adding detailed logging of the nodes found during peer updates improves observability and debugging capabilities, especially for distributed node issues.
179-183
: Informative logging for skipped node/pod combinations.Adding logging when skipping node/pod combinations provides visibility into the peer selection process, which is helpful for understanding which peers are considered during updates.
194-195
: Comprehensive peer address logging.The detailed logging of both worker and control plane peer addresses when retrieving peer addresses significantly improves observability, particularly for debugging peer connectivity issues.
200-200
: Improved role-specific peer logging.Adding specific log messages for worker and control plane peer retrieval requests enhances the clarity of the peer discovery process, making it easier to track role-based peer operations.
Also applies to: 203-203
86-87
:✅ Verification successful
Method name change from
updateControlPlanePeers
toUpdateControlPlanePeers
.This change from private to public method enhances API accessibility, allowing external components to trigger control plane peer updates. Ensure this change is intentional and that all callers are updated to use the new name.
Also applies to: 92-93
🏁 Script executed:
#!/bin/bash # Check for all callers of the renamed method rg -A 2 -B 2 -p "updateControlPlanePeers|UpdateControlPlanePeers" --type goLength of output: 1869
All callers updated: Public method rename is safe
Confirmed that the private
updateControlPlanePeers
identifier has been fully replaced by the publicUpdateControlPlanePeers
across the codebase. No remaining references to the old name were found.Key call sites:
- pkg/peers/peers.go: lines 85–87 and 91–93
- controllers/tests/controller/selfnoderemediation_controller_test.go: lines 1142 & 1147
No further changes required.
controllers/tests/shared/shared.go (1)
101-107
: UUID-based IPv6 test addresses can collide across processesWhile unlikely, deriving an IPv6 address from the raw UUID bytes does not set the locally-administered bit (RFC 4193) and may occasionally overlap with real cluster traffic if the test runs on the host network.
A safer pattern is to prependfd00:
(ULA range) or usenet.ParseIP("::1")
for pure unit tests.Would you like a patch that generates deterministic ULA addresses (
fdxx:
) instead?controllers/tests/controller/selfnoderemediation_controller_test.go (1)
1237-1244
: Global mutable test state – verify isolation
configureRemediationStrategy
mutates the package-levelremediationStrategy
variable that is also toggled directly in other test contexts.
While Ginkgo executes tests serially by default, parallel runs or future refactors can introduce surprising cross-test coupling.Please confirm that:
- No test suite is executed with
-parallel
> 1.- No goroutines read this variable concurrently.
If parallelism is anticipated, consider passing the strategy explicitly to helper functions instead of relying on a shared global.
if kcw.ShouldReturnRandomPodIPs { | ||
logf.Log.Info("Returning random IP addresses for all the pods because ShouldReturnRandomPodIPs is true") | ||
|
||
if podList, ok := list.(*corev1.PodList); ok { | ||
assignRandomIpAddressesPods(podList) | ||
} | ||
} | ||
return kcw.Client.List(ctx, list, opts...) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Random IP helper forgets to update .Status.PodIP
assignRandomIpAddressesPods()
populates PodIPs
, but callers relying on the legacy singular field (pod.Status.PodIP
) will still observe an empty value.
Several components (including controller-runtime caches and some older network add-ons) still look at PodIP
when only one address is expected.
Consider mirroring the first generated address into both fields:
func assignRandomIpAddressesPods(pods *corev1.PodList) {
for i := range pods.Items {
- pods.Items[i].Status.PodIPs = []corev1.PodIP{{IP: GetRandomIpAddress()}}
+ randIP := GetRandomIpAddress()
+ pods.Items[i].Status.PodIP = randIP // keep legacy field in sync
+ pods.Items[i].Status.PodIPs = []corev1.PodIP{{IP: randIP}}
}
}
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
if kcw.ShouldReturnRandomPodIPs { | |
logf.Log.Info("Returning random IP addresses for all the pods because ShouldReturnRandomPodIPs is true") | |
if podList, ok := list.(*corev1.PodList); ok { | |
assignRandomIpAddressesPods(podList) | |
} | |
} | |
return kcw.Client.List(ctx, list, opts...) | |
func assignRandomIpAddressesPods(pods *corev1.PodList) { | |
for i := range pods.Items { | |
randIP := GetRandomIpAddress() | |
pods.Items[i].Status.PodIP = randIP // keep legacy field in sync | |
pods.Items[i].Status.PodIPs = []corev1.PodIP{{IP: randIP}} | |
} | |
} |
🤖 Prompt for AI Agents
In controllers/tests/shared/shared.go around lines 82 to 89, the function
assignRandomIpAddressesPods updates PodIPs but does not set the legacy singular
field pod.Status.PodIP, which some components still rely on. To fix this, modify
assignRandomIpAddressesPods to also set pod.Status.PodIP to the first IP address
assigned in PodIPs for each pod, ensuring compatibility with components
expecting a single PodIP.
ckw.ApiConnectivityCheck.SetHealthStatusFunc(func(endpointIp corev1.PodIP, results chan<- selfNodeRemediation.HealthCheckResponseCode) { | ||
switch { | ||
case ckw.ShouldSimulatePeerResponses: | ||
for _, code := range ckw.SimulatePeerResponses { | ||
results <- code | ||
} | ||
|
||
return | ||
default: | ||
ckw.ApiConnectivityCheck.GetDefaultPeerHealthCheckFunc()(endpointIp, results) | ||
break | ||
} | ||
}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Deadlock risk: simulated peer sends > 1 message to a bounded channel
getHealthStatusFromPeers()
allocates resultsChan
with capacity = len(addresses)
and waits for exactly one message per peer.
The current override loops over ckw.SimulatePeerResponses
and may write multiple codes, causing the goroutine to block on the second send and the whole test to hang.
- case ckw.ShouldSimulatePeerResponses:
- for _, code := range ckw.SimulatePeerResponses {
- results <- code
- }
- return
+ case ckw.ShouldSimulatePeerResponses:
+ // Provide exactly one deterministic response per call
+ if len(ckw.SimulatePeerResponses) > 0 {
+ results <- ckw.SimulatePeerResponses[0]
+ } else {
+ results <- selfNodeRemediation.RequestFailed
+ }
+ return
If you need per-peer variance, consider indexing into the slice with a counter guarded by a mutex instead of pushing multiple values at once.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
ckw.ApiConnectivityCheck.SetHealthStatusFunc(func(endpointIp corev1.PodIP, results chan<- selfNodeRemediation.HealthCheckResponseCode) { | |
switch { | |
case ckw.ShouldSimulatePeerResponses: | |
for _, code := range ckw.SimulatePeerResponses { | |
results <- code | |
} | |
return | |
default: | |
ckw.ApiConnectivityCheck.GetDefaultPeerHealthCheckFunc()(endpointIp, results) | |
break | |
} | |
}) | |
ckw.ApiConnectivityCheck.SetHealthStatusFunc(func(endpointIp corev1.PodIP, results chan<- selfNodeRemediation.HealthCheckResponseCode) { | |
switch { | |
case ckw.ShouldSimulatePeerResponses: | |
// Provide exactly one deterministic response per call | |
if len(ckw.SimulatePeerResponses) > 0 { | |
results <- ckw.SimulatePeerResponses[0] | |
} else { | |
results <- selfNodeRemediation.RequestFailed | |
} | |
return | |
default: | |
ckw.ApiConnectivityCheck.GetDefaultPeerHealthCheckFunc()(endpointIp, results) | |
break | |
} | |
}) |
🤖 Prompt for AI Agents
In controllers/tests/shared/shared.go around lines 116 to 128, the override
function sends multiple messages to a bounded channel that expects exactly one
message per peer, causing a deadlock. To fix this, modify the function to send
only one message per peer by indexing into ckw.SimulatePeerResponses with a
counter protected by a mutex, ensuring only one code is sent per call to avoid
blocking.
By("Set the existing unhealthy node as a control node", func() { | ||
previousRole := unhealthyNode.Labels[labels2.MasterRole] | ||
unhealthyNode.Labels[labels2.MasterRole] = "true" | ||
Expect(k8sClient.Update(context.TODO(), unhealthyNode)).To(Succeed(), "failed to update unhealthy node") | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cleanup doesn’t persist the label rollback – subsequent tests may inherit the control-plane label
Inside configureUnhealthyNodeAsControlNode
, the DeferCleanup
callback only changes the in-memory unhealthyNode
object but never updates the cluster.
If another test starts before this process ends (or if the global state is inspected in verifyCleanState
), the node will still carry the node-role.kubernetes.io/master=true
label, leading to flaky or misleading results.
- unhealthyNode.Labels[labels2.MasterRole] = previousRole
+ unhealthyNode.Labels[labels2.MasterRole] = previousRole
+ // Persist the rollback so follow-up tests start from
+ // a clean state
+ Expect(k8sClient.Update(context.TODO(), unhealthyNode)).To(Succeed(),
+ "failed to restore the unhealthy node label after test cleanup")
Committable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In controllers/tests/controller/selfnoderemediation_controller_test.go around
lines 1197 to 1201, the cleanup function modifies the unhealthyNode's labels
only in memory without updating the cluster, causing the label rollback to not
persist. To fix this, update the cluster state by calling k8sClient.Update with
the modified unhealthyNode inside the DeferCleanup callback to ensure the label
removal is persisted and does not affect subsequent tests.
By(fmt.Sprintf("Attempting to get pod '%s' before deleting it", podName), func() { | ||
if err := k8sClient.Client.Get(context.Background(), podKey, pod); err != nil { | ||
if apierrors.IsNotFound(err) && !throwErrorIfNotFound { | ||
logf.Log.Info("pod with name '%s' not found, we're not going to do anything", podName) | ||
err = nil | ||
return | ||
} | ||
|
||
var grace client.GracePeriodSeconds = 0 | ||
ExpectWithOffset(1, k8sClient.Client.Delete(context.Background(), pod, grace)).To(Succeed()) | ||
err = fmt.Errorf("unable to get pod with name '%s' in order to delete it", err) | ||
return | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Variable shadowing hides the named return value and loses context
The inner err := …
shadows the function’s named err
return value, then the assignment err = …
inside the block modifies the shadowed variable, not the function return.
As a result, the caller may receive a nil
error even though the Get failed.
- if err := k8sClient.Client.Get(context.Background(), podKey, pod); err != nil {
- if apierrors.IsNotFound(err) && !throwErrorIfNotFound {
+ getErr := k8sClient.Client.Get(context.Background(), podKey, pod)
+ if getErr != nil {
+ if apierrors.IsNotFound(getErr) && !throwErrorIfNotFound {
logf.Log.Info("pod with name '%s' not found, we're not going to do anything", podName)
- err = nil
+ err = nil
return
}
-
- err = fmt.Errorf("unable to get pod with name '%s' in order to delete it", err)
+ err = fmt.Errorf("unable to get pod with name '%s' in order to delete it: %w", podName, getErr)
return
}
🤖 Prompt for AI Agents
In controllers/tests/controller/selfnoderemediation_controller_test.go around
lines 875 to 885, the variable err is redeclared inside the if block, shadowing
the named return err and causing the function to return nil even on failure. To
fix this, remove the short variable declaration (:=) and use assignment (=) for
err inside the if block so that the named return variable is correctly updated
and returned.
…#251 exists and is a real problem.
…y and getWorkerPeersResponse for issue medik8s#251
6397c3e
to
befbcbb
Compare
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: mark-dell-usa The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (4)
controllers/tests/shared/shared.go (2)
82-89
: Random IP helper forgets to update.Status.PodIP
assignRandomIpAddressesPods()
populatesPodIPs
, but callers relying on the legacy singular field (pod.Status.PodIP
) will still observe an empty value.
Several components (including controller-runtime caches and some older network add-ons) still look atPodIP
when only one address is expected.
Consider mirroring the first generated address into both fields:func assignRandomIpAddressesPods(pods *corev1.PodList) { for i := range pods.Items { - pods.Items[i].Status.PodIPs = []corev1.PodIP{{IP: GetRandomIpAddress()}} + randIP := GetRandomIpAddress() + pods.Items[i].Status.PodIP = randIP // keep legacy field in sync + pods.Items[i].Status.PodIPs = []corev1.PodIP{{IP: randIP}} } }
116-128
: Deadlock risk: simulated peer sends > 1 message to a bounded channel
getHealthStatusFromPeers()
allocatesresultsChan
withcapacity = len(addresses)
and waits for exactly one message per peer.
The current override loops overckw.SimulatePeerResponses
and may write multiple codes, causing the goroutine to block on the second send and the whole test to hang.- case ckw.ShouldSimulatePeerResponses: - for _, code := range ckw.SimulatePeerResponses { - results <- code - } - return + case ckw.ShouldSimulatePeerResponses: + // Provide exactly one deterministic response per call + if len(ckw.SimulatePeerResponses) > 0 { + results <- ckw.SimulatePeerResponses[0] + } else { + results <- selfNodeRemediation.RequestFailed + } + returnIf you need per-peer variance, consider indexing into the slice with a counter guarded by a mutex instead of pushing multiple values at once.
controllers/tests/controller/selfnoderemediation_controller_test.go (2)
875-885
: Variable shadowing hides the named return value and loses contextThe inner
err := …
shadows the function's namederr
return value, then the assignmenterr = …
inside the block modifies the shadowed variable, not the function return.
As a result, the caller may receive anil
error even though the Get failed.- if err := k8sClient.Client.Get(context.Background(), podKey, pod); err != nil { - if apierrors.IsNotFound(err) && !throwErrorIfNotFound { + getErr := k8sClient.Client.Get(context.Background(), podKey, pod) + if getErr != nil { + if apierrors.IsNotFound(getErr) && !throwErrorIfNotFound { logf.Log.Info("pod with name '%s' not found, we're not going to do anything", podName) err = nil return } - - err = fmt.Errorf("unable to get pod with name '%s' in order to delete it", err) + err = fmt.Errorf("unable to get pod with name '%s' in order to delete it: %w", podName, getErr) return }
1197-1201
: Cleanup doesn't persist the label rollback – subsequent tests may inherit the control-plane labelInside
configureUnhealthyNodeAsControlNode
, theDeferCleanup
callback only changes the in-memoryunhealthyNode
object but never updates the cluster.
If another test starts before this process ends (or if the global state is inspected inverifyCleanState
), the node will still carry thenode-role.kubernetes.io/master=true
label, leading to flaky or misleading results.- unhealthyNode.Labels[labels2.MasterRole] = previousRole + unhealthyNode.Labels[labels2.MasterRole] = previousRole + // Persist the rollback so follow-up tests start from + // a clean state + Expect(k8sClient.Update(context.TODO(), unhealthyNode)).To(Succeed(), + "failed to restore the unhealthy node label after test cleanup")
🧹 Nitpick comments (2)
pkg/apicheck/check.go (2)
72-114
: Clean up redundant return and outdated TODOTwo minor improvements:
- Remove the redundant
return
statement at line 110 (flagged by static analysis)- Remove the outdated TODO comment since IPv6 compatibility is confirmed
- // TODO does this work with IPv6? - // MES: Yes it does, we've tested this phClient, err := peerhealth.NewClient(fmt.Sprintf("%v:%v", endpointIp.IP, c.config.PeerHealthPort), c.config.PeerDialTimeout, c.config.Log.WithName("peerhealth client"), c.clientCreds) ... results <- selfNodeRemediation.HealthCheckResponseCode(resp.Status) - return }
173-215
: Well-structured health check logic with improved observabilityThe refactored
isConsideredHealthy
method provides clear separation between worker and control plane logic with excellent logging for debugging. The implementation aligns with the discussed flow from past reviews.Minor suggestion: Consider extracting the complex boolean expression on line 177 into a named variable for clarity:
- isWorkerNode := isControlPlaneManagerNil || !c.controlPlaneManager.IsControlPlane() + isControlPlane := !isControlPlaneManagerNil && c.controlPlaneManager.IsControlPlane() + isWorkerNode := !isControlPlane
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (2)
pkg/peerhealth/peerhealth.pb.go
is excluded by!**/*.pb.go
pkg/peerhealth/peerhealth_grpc.pb.go
is excluded by!**/*.pb.go
📒 Files selected for processing (12)
controllers/selfnoderemediation_controller.go
(3 hunks)controllers/tests/config/suite_test.go
(1 hunks)controllers/tests/controller/selfnoderemediation_controller_test.go
(11 hunks)controllers/tests/controller/suite_test.go
(3 hunks)controllers/tests/shared/shared.go
(3 hunks)go.mod
(1 hunks)pkg/apicheck/check.go
(5 hunks)pkg/controlplane/manager.go
(2 hunks)pkg/peers/peers.go
(4 hunks)pkg/utils/pods.go
(3 hunks)vendor/github.com/onsi/gomega/gcustom/make_matcher.go
(1 hunks)vendor/modules.txt
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (8)
- pkg/controlplane/manager.go
- go.mod
- pkg/utils/pods.go
- controllers/tests/config/suite_test.go
- vendor/modules.txt
- controllers/tests/controller/suite_test.go
- controllers/selfnoderemediation_controller.go
- pkg/peers/peers.go
🧰 Additional context used
🪛 golangci-lint (1.64.8)
pkg/apicheck/check.go
110-110: S1023: redundant return
statement
(gosimple)
🔇 Additional comments (8)
vendor/github.com/onsi/gomega/gcustom/make_matcher.go (1)
1-271
: Vendor dependency addition looks goodThis vendor file adds the
gcustom
package from Gomega v1.27.0+ which provides custom matcher functionality. The integration with the test suite is appropriate for creating the new error matchers inshared.go
.Note: Static analysis detected a redundant
return
statement at line 110, but since this is vendor code, it should be addressed upstream rather than modified locally.controllers/tests/shared/shared.go (1)
167-206
: Well-implemented custom error matchersThe custom Gomega matchers for Kubernetes errors are cleanly implemented with proper nil handling. This improves test readability and reusability.
pkg/apicheck/check.go (2)
28-40
: Excellent refactoring for testabilityThe introduction of
GetHealthStatusFromRemoteFunc
and the function injection pattern improves testability while maintaining backward compatibility with the default implementation. This follows the dependency injection principle effectively.Also applies to: 59-70
217-322
: Good refactoring to support role-based peer queryingThe generalization from worker-specific to role-based peer querying improves code reusability. The additional logging statements will be valuable for troubleshooting peer communication issues in production.
controllers/tests/controller/selfnoderemediation_controller_test.go (4)
39-82
: Excellent test setup refactoringThe restructured setup with descriptive
By
blocks and comprehensive cleanup improves test maintainability and debugging. The separation of concerns betweenBeforeEach
andJustBeforeEach
is well-organized.
448-495
: Well-structured test case for issue #251This test effectively reproduces the control plane peer health check scenario described in issue #251. The setup clearly configures the environment with control plane nodes and simulated unhealthy responses.
1134-1337
: Excellent test helper functionsThe new helper functions provide clean abstractions for test configuration with proper cleanup. The consistent use of
DeferCleanup
ensures test isolation, and the descriptive function names make test scenarios self-documenting.
1003-1041
: Comprehensive cluster state verificationThe enhanced
verifyCleanState
function thoroughly validates the cluster state after tests, including checking for specific SNR pods by name. This will help catch cleanup issues early.
func GetRandomIpAddress() (randomIP string) { | ||
u := uuid.New() | ||
ip := net.IP(u[:net.IPv6len]) | ||
randomIP = ip.String() | ||
|
||
return | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use proper IPv6 address generation instead of UUID bytes
Converting UUID bytes directly to IPv6 addresses may generate invalid or reserved addresses. Consider using proper IPv6 address generation that respects network standards:
func GetRandomIpAddress() (randomIP string) {
- u := uuid.New()
- ip := net.IP(u[:net.IPv6len])
- randomIP = ip.String()
+ // Generate a random IPv6 address in the fd00::/8 range (Unique Local Address)
+ bytes := make([]byte, net.IPv6len)
+ bytes[0] = 0xfd // ULA prefix
+ _, _ = rand.Read(bytes[1:])
+ ip := net.IP(bytes)
+ randomIP = ip.String()
return
}
This ensures the generated addresses are valid and in a proper test range.
Committable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In controllers/tests/shared/shared.go around lines 101 to 107, the current
method generates an IPv6 address by directly converting UUID bytes, which can
produce invalid or reserved addresses. Replace this approach with a proper IPv6
address generation method that respects network standards, such as generating
addresses within a designated test range or using a library function designed
for valid IPv6 creation. This will ensure the generated IPs are valid and
suitable for testing.
Self Node Remediation - Control Plane Edge Case AnalysisOverviewThis document analyzes a critical edge case in the Self Node Remediation (SNR) operator where a control plane node becomes partially degraded but never remediates itself. This occurs when:
The Edge Case ScenarioInitial State
Workflow Diagrams1. Normal SNR Health Check Flowflowchart TD
Start([SNR Pod Health Check Timer]) --> APICheck[Check API Server /readyz]
APICheck -->|Success| Healthy[Mark as Healthy]
APICheck -->|Failure| ErrorCount{Error Count > Threshold?}
ErrorCount -->|No| IncrementError[Increment Error Counter]
IncrementError --> Wait[Wait for Next Check]
ErrorCount -->|Yes| PeerCheck[Query Peer Nodes]
PeerCheck --> PeerResponse{Peer Response Analysis}
PeerResponse -->|Majority say Unhealthy| CreateSNR[Trigger Remediation]
PeerResponse -->|Majority say Healthy| ResetCounter[Reset Error Counter]
PeerResponse -->|Most can't reach API| ControlPlaneCheck{Is Control Plane?}
ControlPlaneCheck -->|Worker Node| ConsiderHealthy[Consider Healthy]
ControlPlaneCheck -->|Control Plane| RunDiagnostics[Run Diagnostics]
RunDiagnostics --> DiagResult{Diagnostics Pass?}
DiagResult -->|Yes| ConsiderHealthy
DiagResult -->|No| CreateSNR
ResetCounter --> Wait
ConsiderHealthy --> Wait
CreateSNR --> Remediate[Begin Remediation Process]
2. Control Plane Edge Case Flow (The Bug)flowchart TD
Start([Control Plane Node:<br/>API Server DOWN]) --> APIFails[API Check Fails Repeatedly]
APIFails --> QueryPeers[Query Worker Peers]
QueryPeers --> PeerStatus{Worker Peer Responses}
PeerStatus -->|">50% also can't<br/>reach API Server"| MostCantAccess[Status: HealthyBecauseMostPeersCantAccessAPIServer]
MostCantAccess --> CPDiag[Run Control Plane Diagnostics]
CPDiag --> EndpointCheck{Check Endpoint<br/>Health URL}
EndpointCheck -->|Not Configured or<br/>Was Never Accessible| EndpointPass[Endpoint Check: PASS]
EndpointPass --> KubeletCheck{Is Kubelet<br/>Running?}
KubeletCheck -->|Port 10250<br/>Responds| KubeletPass[Kubelet Check: PASS]
KubeletPass --> DiagPass[Diagnostics: PASSED ✓]
DiagPass --> MarkHealthy[Node Marked as HEALTHY]
MarkHealthy --> NoRemediation[❌ NO REMEDIATION TRIGGERED]
NoRemediation --> Impact[Control Plane Remains Broken:<br/>- No API Server<br/>- No Scheduling<br/>- No Controllers<br/>- Cluster Partially Down]
3. Peer Health Check DetailssequenceDiagram
participant CP as Control Plane<br/>(Broken API)
participant W1 as Worker 1
participant W2 as Worker 2
participant W3 as Worker 3
Note over CP: API Check Fails
CP->>W1: Is my SNR CR present?
CP->>W2: Is my SNR CR present?
CP->>W3: Is my SNR CR present?
W1--xCP: Error: Can't reach API
W2--xCP: Error: Can't reach API
W3--xCP: Error: Can't reach API
Note over CP: >50% peers have API errors
Note over CP: Status = HealthyBecauseMostPeersCantAccessAPIServer
Note over CP: Run Diagnostics:
Note over CP: ✓ Kubelet Running
Note over CP: ✓ No Endpoint URL
Note over CP: = HEALTHY (Bug!)
4. The Diagnostic Gapflowchart LR
subgraph "Current Diagnostics"
D1[Endpoint Health Check]
D2[Kubelet Service Check]
end
subgraph "Missing Checks"
M1[API Server Process]
M2[Controller Manager]
M3[Scheduler]
M4[Etcd Connectivity]
end
subgraph "Result"
R1[False Positive:<br/>Node Considered Healthy<br/>Despite Being Non-Functional]
end
D1 --> R1
D2 --> R1
M1 -.->|Should Check| R1
M2 -.->|Should Check| R1
M3 -.->|Should Check| R1
M4 -.->|Should Check| R1
Code AnalysisThe Bug LocationFile: func (manager *Manager) IsControlPlaneHealthy(workerPeersResponse peers.Response,
canOtherControlPlanesBeReached bool) bool {
switch workerPeersResponse.ReasonCode {
// ...
case peers.HealthyBecauseMostPeersCantAccessAPIServer:
didDiagnosticsPass := manager.isDiagnosticsPassed()
manager.log.Info("The peers couldn't access the API server, so we are returning whether "+
"diagnostics passed", "didDiagnosticsPass", didDiagnosticsPass)
return didDiagnosticsPass // <-- BUG: Returns true if kubelet is running
// ...
}
}
func (manager *Manager) isDiagnosticsPassed() bool {
manager.log.Info("Starting control-plane node diagnostics")
if manager.isEndpointAccessLost() {
return false
} else if !manager.isKubeletServiceRunning() { // <-- Only checks kubelet!
return false
}
manager.log.Info("Control-plane node diagnostics passed successfully")
return true
} The ProblemThe diagnostics only check:
They DO NOT check:
ImpactThis edge case creates a situation where:
Recommended SolutionsSolution 1: Enhanced Diagnosticsfunc (manager *Manager) isDiagnosticsPassed() bool {
// Existing checks...
// Add: Check if API server is running locally
if !manager.isAPIServerRunningLocally() {
manager.log.Info("API server is not running locally")
return false
}
// Add: Check local API server connectivity
if !manager.canReachLocalAPIServer() {
manager.log.Info("Cannot reach local API server")
return false
}
return true
} Solution 2: Control Plane Specific LogicWhen
Solution 3: Timeout-Based RemediationIf a control plane node remains in this state for a configurable duration:
References
ConclusionThe current SNR implementation has a critical gap in control plane node health assessment. When the API server fails but kubelet remains running, the node incorrectly considers itself healthy, preventing automatic remediation. This edge case requires either enhanced diagnostics that specifically check control plane components or a fundamental change in how control plane node health is evaluated during cluster-wide API server outages. |
Find an updated analysis here of what I did, perhaps it will help, I attempted to pull it together with diagrams to make it clear. https://gist.github.com/mark-dell-usa/576901e5cc420114bcf1076ff5d57f52 |
Why we need this PR
Unit test to show problem with Issue #251 as well as a potential fix.
Please feel free to disregard as much of or as little of this as you want, hopefully the unit test refactoring is seen as useful, but I believe it sufficiently highlights the core issue. I separated it out into multiple commits so you can look at branch history and see the individual updates prior to the squash commit that will happen to main. This is intended so you can see a delta of the actual meat & potatoes of the code fix itself which was relatively small and could be pulled in separately in the worst case scenario.
Note: I created an official account for my Dell contributions, still the same Mark who opened the issue 251 originally though.
Changes made
Which issue(s) this PR fixes
Fixes #251
Test plan
Summary by CodeRabbit
New Features
Bug Fixes
Refactor
Style
Chores