[WIP] Fix issue 251 peer control nodes #255

mark-dell-usa · 2025-03-26T22:10:48Z

Why we need this PR

Unit test to show problem with Issue #251 as well as a potential fix.

Please feel free to disregard as much of or as little of this as you want, hopefully the unit test refactoring is seen as useful, but I believe it sufficiently highlights the core issue. I separated it out into multiple commits so you can look at branch history and see the individual updates prior to the squash commit that will happen to main. This is intended so you can see a delta of the actual meat & potatoes of the code fix itself which was relatively small and could be pulled in separately in the worst case scenario.

Note: I created an official account for my Dell contributions, still the same Mark who opened the issue 251 originally though.

Changes made

Update selfnoderemediation_controller_test.go in order to add control plane node configurability. To make this work right I tried to refactor it based on my understanding of ginkgo/gomega best practices, which fixed some overall issues with setup & teardown that I would see occasionally as unit test instabilities.
Add a new unit test, which shows the Remediation doesn't occur when node can contact peer control plane nodes even if they consider it unhealthy #251 scenario that we discovered in the lab, and fails
Implement a possible fix (if I'm understanding the intent right), and show that all new and existing unit tests still work.

Which issue(s) this PR fixes

Fixes #251

Test plan

Added new unit tests.

Summary by CodeRabbit

New Features
- Enhanced test utilities and scenarios for self-node remediation, including support for simulating peer responses, randomized pod IPs, and richer peer communication tests.
- Introduced custom matchers for improved error handling in tests.
Bug Fixes
- Improved error messages for pod retrieval failures, providing clearer context and preserving error chains.
Refactor
- Modularized and clarified test setup, teardown, and peer simulation logic for better maintainability and extensibility.
- Refactored peer health check logic to allow for pluggable health status functions, enabling more flexible testing and diagnostics.
Style
- Added detailed logging throughout remediation, peer management, and control plane workflows to improve traceability and observability.
Chores
- Updated and clarified dependency management for required libraries.

openshift-ci · 2025-03-26T22:10:59Z

Hi @mark-scott-jr-dell. Thanks for your PR.

I'm waiting for a medik8s member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

slintes

Sorry for the late first review, we are pretty busy...

slintes · 2025-04-01T08:56:27Z

pkg/apicheck/check.go

+		//canOtherControlPlanesBeReached := c.canOtherControlPlanesBeReached()
+		peersResponse = c.getPeersResponse(peers.ControlPlane)
+
+		// MES: This does not appear to have any actual relevance.  To me, it appears that all the necessary


IsControlPlaneHealthy() not being relevant is a bold statement ;)

However, before going on with a more detailed review, I think it makes sense to first write down the expected flow, for worker nodes, for control plane nodes, when API server is available and when not, when we have peers or not, which peers to ask, etc.

cc @mshitrit

I think it makes sense to first write down the expected flow, for worker nodes, for control plane nodes, when API server is available and when not, when we have peers or not, which peers to ask, etc.

+1

I think this change significantly changes current logic.
Couple of things I've noticed:

In the new code for CP nodes we completely ignore feedback of worker nodes. for most use case worker nodes can accurately report the status of the CP nodes and even though I expect the CP peers to report the same I'm not sure that ignoring the worker peers would be the best option.

diagnostic logic (i.e isDiagnosticsPassed()) is removed, which means the node can be falsely considered healthy for some use cases

IsControlPlaneHealthy() not being relevant is a bold statement ;)

Haha I did say "does not APPEAR to have any actual relevence" to be fair, based on my observations. I definitely left room in there for me to be wrong 😂.

However, before going on with a more detailed review, I think it makes sense to first write down the expected flow, for worker nodes, for control plane nodes, when API server is available and when not, when we have peers or not, which peers to ask, etc.

This would help a lot, my actual code changes were based on how I understood the expected flow to go, I attempted to interpret this based on the intention I saw in the code. My goal was to attempt not to change far too much and keep behaviors the same today since I personally don't know all the intentions, nor did I find it documented in detail anywhere (correct me if I'm wrong to be sure!).

I think it makes sense to first write down the expected flow, for worker nodes, for control plane nodes, when API server is available and when not, when we have peers or not, which peers to ask, etc.

+1

I think this change significantly changes current logic. Couple of things I've noticed:

In the new code for CP nodes we completely ignore feedback of worker nodes. for most use case worker nodes can accurately report the status of the CP nodes and even though I expect the CP peers to report the same I'm not sure that ignoring the worker peers would be the best option.

diagnostic logic (i.e isDiagnosticsPassed()) is removed, which means the node can be falsely considered healthy for some use cases

I'll look back at this later today to respond, especially w.r.t isDiagnosticsPassed, but I did spend some time walking through the flows and found multiple checks that basically referenced the same data multiple times, so I was attempting to simplify so it was clear what the code was doing. I felt it was unclear visually what was actually going on.

Ultimately, if we use the updated unit test just to prove out the core issue, I'm still good - our goal was to prove it so that it could be fixed, since it's a PITA to get logs in that case due to the nature of the cluster status at that point, so I personally picked the stretch goal of creating the unit test which would be better for the long term (in theory).

@mshitrit Not sure if you've had a chance to think about the logic flows here, but, if I'm reading things correctly, an update like this?

Attempt to get control plane responses

Attempt to get worker responses

Some combination of these should say that the node is healthy. Also add in isDiagnosticPassed

I'm willing to implement it and push a new PR, just want to be sure that I use the flow that you have in mind.

We are looking to pull the latest build whenever this is merged to main, and get some needed CVE fixes and other things, so I'd love to drive this to a close ASAP. I thought I had posted this message a week and a half ago but I guess it went into the ether.

Hi I think that pretty close, writing down both current and what I understand is the desired flow:

Attempt to get worker responses

If a worker return that response

else (assuming it's a control plane)

get a control plane response

Some combination of these should say that the node is healthy. Also add in isDiagnosticPassed

IIUC the fix is aiming for this flow:

Attempt to get control plane responses

Attempt to get worker responses

If a worker Some combination of these should say that the node is healthy (CP response is only relevant for some use cases of a worker node healthy response otherwise it can be ignored)

else Some combination of these should say that the node is healthy. Also add in isDiagnosticPassed

@mshitrit I pushed a set of changes to IsConsideredHealthy which I believe matches what you have described as the desired flow. Let me know your thoughts.

mshitrit · 2025-04-01T10:37:57Z

pkg/apicheck/check.go

 }

-func (c *ApiConnectivityCheck) getWorkerPeersResponse() peers.Response {
+func (c *ApiConnectivityCheck) getPeersResponse(role peers.Role) peers.Response {


I like this refactoring 👍

mshitrit · 2025-05-04T08:27:50Z

pkg/apicheck/check.go

+		controlPlanePeersResponse := c.getPeersResponse(peers.ControlPlane)
+
+		c.config.Log.Info("isConsideredHealthy: since peers think I'm unhealthy, double checking "+
+			"by returning what the control plane nodes think of my state",
+			"controlPlanePeersResponse.IsHealthy", controlPlanePeersResponse.IsHealthy)
+		return controlPlanePeersResponse.IsHealthy


IIUC basically this means that for a worker node the CP peers response will override the Worker peers response (unless worker peers response is healthy).
It completely ignores why the worker response was unhealthy.
TBH I'm having a hard time of thinking of an example this would not work as expected, but ignoring the reason still feels dangerous.

@slintes maybe you have a stronger opinion either way ?

@mshitrit I hear you, I still don't necessarily have a full feeling of what the fully desired behavior, so trying to translate what I am hearing from you and what I've seen overall.

Definitely will change it to match what the medik8s team thinks is appropriate. My main goal was to prove the issue with the unit test, and attempt to come in with a solution rather than just toss the problem in your collective laps. Intention is not to change intended behaviors, especially since it could have impacts on existing installations.

We are still discussing details on the PR without knowing what we're aiming at.
Again, can write down the expected flow first?

#255 (comment)

My apologies, @slintes I'm honestly looking for ya'lls feedback on what the expected flow is. I identified a problem case, wherein a control plane node can go down. I didn't fully know what the expected flow was supposed to be, but I did know in this case it was wrong.

So, the unit test proves the issue, which I believe could affect others utilizing SNR in different ways. I tried to provide some solution, but am totally ok if it's not the right solution.

If you are asking me what the expected flow is, I can sit down and draft out what I think it should be for the entire system for CP nodes as well as worker nodes, and am happy to do it, but I was hoping to start with a baseline of what the medik8s team believed the workflow was supposed to theoretically believe.

So, if this reply is directed at me, I'd ask that you be more specific, are you asking me to write down the expected flow for the entire system? (Which is what I seem to have possibly gotten incorrect in my proposed solution)

We are still discussing details on the PR without knowing what we're aiming at. Again, can write down the expected flow first?

#255 (comment)

If so, I'm up for taking a stab at it, but I don't have the background of why existing decisions were made.

I identified a problem case
the unit test proves the issue

And that's great, thank you 👍🏼
But I think the code changes significantly change the existing flow of things, which desires some general discussion of what the expected flow should be.

If you are asking me what the expected flow is

No no, it was a general ask, sorry if it sounded inappropriate. I would do it myself if I had more time for this...

Our docs contain some diagrams. As a first step we can verify if they are up to date and aligned with the current code. And then if they still make sense. Is the issue is even visible there?

https://www.medik8s.io/remediation/self-node-remediation/how-it-works/

I identified a problem case
the unit test proves the issue

And that's great, thank you 👍🏼 But I think the code changes significantly change the existing flow of things, which desires some general discussion of what the expected flow should be.

If you are asking me what the expected flow is

No no, it was a general ask, sorry if it sounded inappropriate. I would do it myself if I had more time for this...

Ok, understood.

Our docs contain some diagrams. As a first step we can verify if they are up to date and aligned with the current code. And then if they still make sense. Is the issue is even visible there?

https://www.medik8s.io/remediation/self-node-remediation/how-it-works/

It had been so long since our initial implementation I forgot these diagrams that exist. Let me review today and see if I can offer a proposal, given your time constraints, and update diagrams. Perhaps I can shorten the time required

coderabbitai · 2025-05-15T22:35:14Z

Walkthrough

The changes introduce extensive logging and refactoring across the remediation, peer, and API connectivity check logic, as well as major enhancements to the test suite. The peer health check mechanism is now more flexible, supporting injected health-check functions and improved role-based peer querying. Tests are refactored for modularity, richer scenarios, and improved observability.

Changes

File(s)	Change Summary
controllers/selfnoderemediation_controller.go, pkg/controlplane/manager.go, pkg/peers/peers.go	Added detailed logging at key workflow points for remediation phases, node role determination, and peer updates/retrieval. Improved error messages for unknown phases and pod lookup failures.
pkg/apicheck/check.go	Refactored peer health check logic: introduced injectable health check function, unified peer querying by role, enhanced logging, and improved health evaluation for both worker and control plane nodes. Added new exported types and methods.
controllers/tests/controller/selfnoderemediation_controller_test.go	Major test refactor: modularized setup/teardown, added helper/util functions, expanded unhealthy node scenarios, improved cleanup, and enhanced peer/API server failure simulation. Introduced new types and global variables.
controllers/tests/shared/shared.go	Enhanced test utilities: refactored K8s client wrapper for random pod IPs, added peer health check simulation wrapper, new constants, and custom Gomega matchers for Kubernetes errors.
controllers/tests/config/suite_test.go, controllers/tests/controller/suite_test.go	Updated API connectivity check configuration to use new default values. Changed instantiation to use a wrapper. Added global variable for API check wrapper.
pkg/utils/pods.go	Improved error wrapping and clarity in pod retrieval errors using `fmt.Errorf` for better diagnostics.
go.mod, vendor/modules.txt	Added/updated direct dependency on `github.com/google/uuid` and included new Gomega `gcustom` matcher package in vendored modules.
vendor/github.com/onsi/gomega/gcustom/make_matcher.go	Introduced new utility for creating custom Gomega matchers with flexible failure message formatting using templates.

Sequence Diagram(s)

sequenceDiagram
    participant Node
    participant ApiConnectivityCheck
    participant Peers
    participant ControlPlaneManager

    Node->>ApiConnectivityCheck: isConsideredHealthy()
    ApiConnectivityCheck->>Peers: getPeersResponse(role)
    Peers-->>ApiConnectivityCheck: Peer addresses
    ApiConnectivityCheck->>ApiConnectivityCheck: getHealthStatusFromPeer (via injected func)
    ApiConnectivityCheck-->>Node: Health status (aggregated)
    Node->>ControlPlaneManager: IsControlPlane()
    ControlPlaneManager-->>Node: Boolean (role)

Assessment against linked issues

Objective	Addressed	Explanation
Ensure remediation occurs when a control plane node loses API server access and peer control plane nodes report it as unhealthy (#251)	✅
Improve logging and diagnostics to clarify peer health evaluation and remediation decisions (#251)	✅
Refactor and test peer health check logic to allow for better simulation and testing of peer responses (#251)	✅

Poem

A rabbit hops through logs anew,
With peers and nodes in clearer view.
Health checks leap from role to role,
While tests now bound with greater control.
In fields of code, the bugs retreat—
Remediation’s logic, now complete!
🐇✨

✨ Finishing Touches

📝 Generate Docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Explain this complex logic.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai explain this code block.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and explain its main purpose.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR.
@coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 4

🧹 Nitpick comments (6)

pkg/apicheck/check.go (3)
209-213: Log message prints the wrong variable

isControlPlaneHealthy is computed a few lines above, but the log prints controlPlanePeersResponse.IsHealthy, duplicating earlier output and hiding the final decision.
- c.config.Log.Info("isConsideredHealthy: we have checkd the control plane peer responses and cross "+
-   "checked it against the control plane diagnostics ",
-   "isControlPlaneHealthy", controlPlanePeersResponse.IsHealthy)
+ c.config.Log.Info("isConsideredHealthy: evaluated peer responses & diagnostics",
+   "isControlPlaneHealthy", isControlPlaneHealthy)
223-225: Misleading log text claims “I consider myself a WORKER” irrespective of actual role

The hard-coded message talks about WORKER even when role == peers.ControlPlane.
Replace the literal with role.String() (or similar) to avoid confusion during incident triage.

386-388: Setter lacks concurrency protection

SetHealthStatusFunc may be called from tests while ApiConnectivityCheck is running in a goroutine.
If that ever happens, the write is unsynchronised with reads in getHealthStatusFromPeer, leading to a data race.
Wrap the field access with the existing mutex or document it as “write-once before Start()”.
vendor/github.com/onsi/gomega/gcustom/make_matcher.go (1)
87-91: Panic message could be clearer

The panic mentions “function that takes one argument and returns (bool, error)”, but omits the possibility of a typed first parameter which you explicitly support.
Consider:
- panic("MakeMatcher must be passed a function that takes one argument and returns (bool, error)")
+ panic("MakeMatcher expects func(<any single param>) (bool, error)")
controllers/tests/controller/selfnoderemediation_controller_test.go (2)

1044-1049: Deep-equality on Node.Status is brittle and frequently fails

reflect.DeepEqual on the whole Status block compares timestamps, resource versions, conditions, etc. that can legitimately change between the expected skeleton returned by getNode and the actual cluster object – even when the node is perfectly “equal” for test purposes.
This can introduce non-deterministic test failures.

Recommendation
• Compare only deterministic fields (e.g., labels, taints) or use a semantic helper such as equality.Semantic.DeepEqual with a well-scoped struct.
• Alternatively, omit Status from the comparison altogether unless a specific field is being asserted.

790-791: Unnecessary time.Sleep slows the suite

A fixed time.Sleep(1 * time.Second) immediately before an Eventually poll adds a full second to every invocation of createGenericSelfNodeRemediationPod.
Eventually already waits until the pod is observed; the explicit sleep can be safely removed to cut runtime.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge Base: Disabled due to data retention organization setting

📥 Commits

Reviewing files that changed from the base of the PR and between ee8985c and 6397c3e.

⛔ Files ignored due to path filters (2)

pkg/peerhealth/peerhealth.pb.go is excluded by !**/*.pb.go
pkg/peerhealth/peerhealth_grpc.pb.go is excluded by !**/*.pb.go

📒 Files selected for processing (12)

controllers/selfnoderemediation_controller.go (3 hunks)
controllers/tests/config/suite_test.go (1 hunks)
controllers/tests/controller/selfnoderemediation_controller_test.go (11 hunks)
controllers/tests/controller/suite_test.go (3 hunks)
controllers/tests/shared/shared.go (3 hunks)
go.mod (1 hunks)
pkg/apicheck/check.go (5 hunks)
pkg/controlplane/manager.go (2 hunks)
pkg/peers/peers.go (4 hunks)
pkg/utils/pods.go (3 hunks)
vendor/github.com/onsi/gomega/gcustom/make_matcher.go (1 hunks)
vendor/modules.txt (1 hunks)

🧰 Additional context used

🧬 Code Graph Analysis (3)

controllers/tests/config/suite_test.go (1)

controllers/tests/shared/shared.go (1)

MinPeersForRemediationConfigDefaultValue (47-47)

controllers/tests/controller/suite_test.go (1)

controllers/tests/shared/shared.go (3)

ApiConnectivityCheckWrapper (59-65)

MinPeersForRemediationConfigDefaultValue (47-47)

NewApiConnectivityCheckWrapper (109-131)

controllers/tests/shared/shared.go (4)

pkg/apicheck/check.go (3)

ApiConnectivityCheck (28-37)

New (59-70)

ApiConnectivityCheckConfig (41-57)

api/HealthCheckResponse.go (1)

HealthCheckResponseCode (3-3)

pkg/controlplane/manager.go (1)

Manager (29-36)

vendor/github.com/onsi/gomega/gcustom/make_matcher.go (1)

MakeMatcher (87-128)

🪛 golangci-lint (1.64.8)

pkg/apicheck/check.go

110-110: S1023: redundant return statement

(gosimple)

🔇 Additional comments (29)

go.mod (1)

25-25: Explicit dependency declaration for UUID library

Moving the UUID package from indirect to direct dependencies correctly reflects its usage in the codebase, specifically for generating random IPv6 addresses in test utilities.

vendor/modules.txt (1)

144-144: Added Gomega custom matcher package

The addition of github.com/onsi/gomega/gcustom in the vendor modules reflects the use of custom matchers in tests, which provides more flexible failure message formatting.

pkg/controlplane/manager.go (2)

57-60: Enhanced logging for control plane role verification

Adding detailed logging about node roles improves observability when checking if a node is a control plane node, which will help with debugging role-based issues.

138-140: Added node labels logging during role assignment

This logging enhancement provides visibility into node labels during role determination, which is valuable for troubleshooting control plane node identification issues, especially when dealing with peer control nodes.
controllers/tests/config/suite_test.go (1)

129-129:
✅ Verification successful

Updated minimum peers configuration constant

Changed from using shared.MinPeersForRemediation to shared.MinPeersForRemediationConfigDefaultValue (value of 1) to align with the new API connectivity check wrapper's default configuration values.

🏁 Script executed:
#!/bin/bash
# Verify the default value in the shared package
grep -r "MinPeersForRemediationConfigDefaultValue" --include="*.go" controllers/tests/shared/
Length of output: 174
Change Approved: Verified default peers configuration constant
The shared.MinPeersForRemediationConfigDefaultValue constant is defined as 1 in controllers/tests/shared/shared.go, matching the intended default. No further changes required.

File updated: controllers/tests/config/suite_test.go (line 129)
pkg/utils/pods.go (3)

5-5: Good addition of the fmt package for improved error handling.

The import of the fmt package is required for the enhanced error wrapping in the error handling sections.

23-24: Excellent improvement to error handling with context.

Using fmt.Errorf with the %w verb provides better error context by wrapping the original error instead of losing the root cause. This change makes debugging much easier by preserving the full error chain.

33-33: Good enhancement to error message clarity.

The improved error message now explicitly includes the node name in the formatted string, making it easier to identify which node is affected when troubleshooting.

controllers/tests/controller/suite_test.go (3)

63-63: Good update to variable type for enhanced testing capabilities.

Changing the type to *shared.ApiConnectivityCheckWrapper allows for more flexible testing by enabling simulation of peer health check responses.

166-166: Updated constant reference for better code maintenance.

Changing from shared.MinPeersForRemediation to shared.MinPeersForRemediationConfigDefaultValue improves code clarity by using a more descriptive constant name and aligns with updated constants in the shared test package.

168-170: Good refactoring to use wrapper for API connectivity checks.

The change from directly using apicheck.New to using shared.NewApiConnectivityCheckWrapper enhances testing capabilities by allowing simulation of peer responses, which is essential for thorough testing of peer control node scenarios.

controllers/selfnoderemediation_controller.go (7)

453-453: Good addition of informative logging at phase start.

This logging statement improves observability by explicitly marking entry into the fencing start phase, making it easier to track the remediation workflow.

456-456: Enhanced logging for pre-reboot phase entry.

This logging statement improves traceability by clearly marking the transition to the pre-reboot completed phase.

459-459: Improved phase transition visibility with logging.

Adding explicit logging for entering the reboot completed phase enhances observability of the remediation workflow.

462-462: Clear logging for fencing completion phase.

This logging statement provides clear indication of reaching the final fencing complete phase in the remediation workflow.

466-466: Enhanced error message with phase value inclusion.

Including the actual phase value in the error message provides more context for troubleshooting unknown phase errors.

500-500: Good indication of pre-reboot completion.

Adding a log statement that clearly indicates when the pre-reboot phase is completed improves workflow visibility.

631-632: Improved error context for node reboot capability.

The expanded error message clearly explains the consequence of the error - that failure to get the agent pod makes the node not reboot capable, which is important context for troubleshooting.
pkg/peers/peers.go (9)

104-116: Good addition of detailed logging and reset functionality.

The refactored updateWorkerPeers method now includes comprehensive logging and a dedicated reset function, improving observability and state management. The use of closures for setter, getter, and reset functions makes the code more modular and testable.

119-134: Well-structured implementation of UpdateControlPlanePeers.

The new implementation follows the same pattern as updateWorkerPeers, providing consistent logging, clear state management, and a dedicated reset function. This parallelism in implementation makes the code more maintainable.

136-137: Good enhancement to updatePeers with reset functionality.

Adding the resetPeers parameter allows for more flexible peer state management, particularly in error recovery scenarios.

150-150: Improved state management with reset function.

Using the dedicated reset function instead of directly manipulating state improves encapsulation and ensures consistent state management across the codebase.

156-156: Enhanced logging of node information.

Adding detailed logging of the nodes found during peer updates improves observability and debugging capabilities, especially for distributed node issues.

179-183: Informative logging for skipped node/pod combinations.

Adding logging when skipping node/pod combinations provides visibility into the peer selection process, which is helpful for understanding which peers are considered during updates.

194-195: Comprehensive peer address logging.

The detailed logging of both worker and control plane peer addresses when retrieving peer addresses significantly improves observability, particularly for debugging peer connectivity issues.

200-200: Improved role-specific peer logging.

Adding specific log messages for worker and control plane peer retrieval requests enhances the clarity of the peer discovery process, making it easier to track role-based peer operations.

Also applies to: 203-203

86-87:
✅ Verification successful

Method name change from updateControlPlanePeers to UpdateControlPlanePeers.

This change from private to public method enhances API accessibility, allowing external components to trigger control plane peer updates. Ensure this change is intentional and that all callers are updated to use the new name.

Also applies to: 92-93

🏁 Script executed:
#!/bin/bash
# Check for all callers of the renamed method
rg -A 2 -B 2 -p "updateControlPlanePeers|UpdateControlPlanePeers" --type go
Length of output: 1869
All callers updated: Public method rename is safe

Confirmed that the private updateControlPlanePeers identifier has been fully replaced by the public UpdateControlPlanePeers across the codebase. No remaining references to the old name were found.

Key call sites:

pkg/peers/peers.go: lines 85–87 and 91–93

controllers/tests/controller/selfnoderemediation_controller_test.go: lines 1142 & 1147

No further changes required.
controllers/tests/shared/shared.go (1)

101-107: UUID-based IPv6 test addresses can collide across processes

While unlikely, deriving an IPv6 address from the raw UUID bytes does not set the locally-administered bit (RFC 4193) and may occasionally overlap with real cluster traffic if the test runs on the host network.
A safer pattern is to prepend fd00: (ULA range) or use net.ParseIP("::1") for pure unit tests.

Would you like a patch that generates deterministic ULA addresses (fdxx:) instead?

controllers/tests/controller/selfnoderemediation_controller_test.go (1)

1237-1244: Global mutable test state – verify isolation

configureRemediationStrategy mutates the package-level remediationStrategy variable that is also toggled directly in other test contexts.
While Ginkgo executes tests serially by default, parallel runs or future refactors can introduce surprising cross-test coupling.

Please confirm that:

No test suite is executed with -parallel > 1.

No goroutines read this variable concurrently.

If parallelism is anticipated, consider passing the strategy explicitly to helper functions instead of relying on a shared global.

coderabbitai · 2025-05-15T22:40:24Z

controllers/tests/shared/shared.go

+	if kcw.ShouldReturnRandomPodIPs {
+		logf.Log.Info("Returning random IP addresses for all the pods because ShouldReturnRandomPodIPs is true")
+
+		if podList, ok := list.(*corev1.PodList); ok {
+			assignRandomIpAddressesPods(podList)
 		}
 	}
-	return kcw.Client.List(ctx, list, opts...)
+


🛠️ Refactor suggestion

Random IP helper forgets to update .Status.PodIP

assignRandomIpAddressesPods() populates PodIPs, but callers relying on the legacy singular field (pod.Status.PodIP) will still observe an empty value.
Several components (including controller-runtime caches and some older network add-ons) still look at PodIP when only one address is expected.
Consider mirroring the first generated address into both fields:

func assignRandomIpAddressesPods(pods *corev1.PodList) { for i := range pods.Items { - pods.Items[i].Status.PodIPs = []corev1.PodIP{{IP: GetRandomIpAddress()}} + randIP := GetRandomIpAddress() + pods.Items[i].Status.PodIP = randIP // keep legacy field in sync + pods.Items[i].Status.PodIPs = []corev1.PodIP{{IP: randIP}} } }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if kcw.ShouldReturnRandomPodIPs {

logf.Log.Info("Returning random IP addresses for all the pods because ShouldReturnRandomPodIPs is true")

if podList, ok := list.(*corev1.PodList); ok {

assignRandomIpAddressesPods(podList)

}

}

return kcw.Client.List(ctx, list, opts...)

func assignRandomIpAddressesPods(pods *corev1.PodList) {

for i := range pods.Items {

randIP := GetRandomIpAddress()

pods.Items[i].Status.PodIP = randIP // keep legacy field in sync

pods.Items[i].Status.PodIPs = []corev1.PodIP{{IP: randIP}}

}

}

🤖 Prompt for AI Agents

In controllers/tests/shared/shared.go around lines 82 to 89, the function assignRandomIpAddressesPods updates PodIPs but does not set the legacy singular field pod.Status.PodIP, which some components still rely on. To fix this, modify assignRandomIpAddressesPods to also set pod.Status.PodIP to the first IP address assigned in PodIPs for each pod, ensuring compatibility with components expecting a single PodIP.

coderabbitai · 2025-05-15T22:40:25Z

controllers/tests/shared/shared.go

+	ckw.ApiConnectivityCheck.SetHealthStatusFunc(func(endpointIp corev1.PodIP, results chan<- selfNodeRemediation.HealthCheckResponseCode) {
+		switch {
+		case ckw.ShouldSimulatePeerResponses:
+			for _, code := range ckw.SimulatePeerResponses {
+				results <- code
+			}
+
+			return
+		default:
+			ckw.ApiConnectivityCheck.GetDefaultPeerHealthCheckFunc()(endpointIp, results)
+			break
+		}
+	})


⚠️ Potential issue

Deadlock risk: simulated peer sends > 1 message to a bounded channel

getHealthStatusFromPeers() allocates resultsChan with capacity = len(addresses) and waits for exactly one message per peer.
The current override loops over ckw.SimulatePeerResponses and may write multiple codes, causing the goroutine to block on the second send and the whole test to hang.

- case ckw.ShouldSimulatePeerResponses: - for _, code := range ckw.SimulatePeerResponses { - results <- code - } - return + case ckw.ShouldSimulatePeerResponses: + // Provide exactly one deterministic response per call + if len(ckw.SimulatePeerResponses) > 0 { + results <- ckw.SimulatePeerResponses[0] + } else { + results <- selfNodeRemediation.RequestFailed + } + return

If you need per-peer variance, consider indexing into the slice with a counter guarded by a mutex instead of pushing multiple values at once.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

ckw.ApiConnectivityCheck.SetHealthStatusFunc(func(endpointIp corev1.PodIP, results chan<- selfNodeRemediation.HealthCheckResponseCode) {

switch {

case ckw.ShouldSimulatePeerResponses:

for _, code := range ckw.SimulatePeerResponses {

results <- code

}

return

default:

ckw.ApiConnectivityCheck.GetDefaultPeerHealthCheckFunc()(endpointIp, results)

break

}

})

ckw.ApiConnectivityCheck.SetHealthStatusFunc(func(endpointIp corev1.PodIP, results chan<- selfNodeRemediation.HealthCheckResponseCode) {

switch {

case ckw.ShouldSimulatePeerResponses:

// Provide exactly one deterministic response per call

if len(ckw.SimulatePeerResponses) > 0 {

results <- ckw.SimulatePeerResponses[0]

} else {

results <- selfNodeRemediation.RequestFailed

}

return

default:

ckw.ApiConnectivityCheck.GetDefaultPeerHealthCheckFunc()(endpointIp, results)

break

}

})

🤖 Prompt for AI Agents

In controllers/tests/shared/shared.go around lines 116 to 128, the override function sends multiple messages to a bounded channel that expects exactly one message per peer, causing a deadlock. To fix this, modify the function to send only one message per peer by indexing into ckw.SimulatePeerResponses with a counter protected by a mutex, ensuring only one code is sent per call to avoid blocking.

coderabbitai · 2025-05-15T22:40:25Z

controllers/tests/controller/selfnoderemediation_controller_test.go

+	By("Set the existing unhealthy node as a control node", func() {
+		previousRole := unhealthyNode.Labels[labels2.MasterRole]
+		unhealthyNode.Labels[labels2.MasterRole] = "true"
+		Expect(k8sClient.Update(context.TODO(), unhealthyNode)).To(Succeed(), "failed to update unhealthy node")
+


⚠️ Potential issue

Cleanup doesn’t persist the label rollback – subsequent tests may inherit the control-plane label

Inside configureUnhealthyNodeAsControlNode, the DeferCleanup callback only changes the in-memory unhealthyNode object but never updates the cluster.
If another test starts before this process ends (or if the global state is inspected in verifyCleanState), the node will still carry the node-role.kubernetes.io/master=true label, leading to flaky or misleading results.

- unhealthyNode.Labels[labels2.MasterRole] = previousRole + unhealthyNode.Labels[labels2.MasterRole] = previousRole + // Persist the rollback so follow-up tests start from + // a clean state + Expect(k8sClient.Update(context.TODO(), unhealthyNode)).To(Succeed(), + "failed to restore the unhealthy node label after test cleanup")

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents

In controllers/tests/controller/selfnoderemediation_controller_test.go around lines 1197 to 1201, the cleanup function modifies the unhealthyNode's labels only in memory without updating the cluster, causing the label rollback to not persist. To fix this, update the cluster state by calling k8sClient.Update with the modified unhealthyNode inside the DeferCleanup callback to ensure the label removal is persisted and does not affect subsequent tests.

coderabbitai · 2025-05-15T22:40:25Z

controllers/tests/controller/selfnoderemediation_controller_test.go

+	By(fmt.Sprintf("Attempting to get pod '%s' before deleting it", podName), func() {
+		if err := k8sClient.Client.Get(context.Background(), podKey, pod); err != nil {
+			if apierrors.IsNotFound(err) && !throwErrorIfNotFound {
+				logf.Log.Info("pod with name '%s' not found, we're not going to do anything", podName)
+				err = nil
+				return
+			}

-	var grace client.GracePeriodSeconds = 0
-	ExpectWithOffset(1, k8sClient.Client.Delete(context.Background(), pod, grace)).To(Succeed())
+			err = fmt.Errorf("unable to get pod with name '%s' in order to delete it", err)
+			return
+		}


⚠️ Potential issue

Variable shadowing hides the named return value and loses context

The inner err := … shadows the function’s named err return value, then the assignment err = … inside the block modifies the shadowed variable, not the function return.
As a result, the caller may receive a nil error even though the Get failed.

- if err := k8sClient.Client.Get(context.Background(), podKey, pod); err != nil { - if apierrors.IsNotFound(err) && !throwErrorIfNotFound { + getErr := k8sClient.Client.Get(context.Background(), podKey, pod) + if getErr != nil { + if apierrors.IsNotFound(getErr) && !throwErrorIfNotFound { logf.Log.Info("pod with name '%s' not found, we're not going to do anything", podName) - err = nil + err = nil return } - - err = fmt.Errorf("unable to get pod with name '%s' in order to delete it", err) + err = fmt.Errorf("unable to get pod with name '%s' in order to delete it: %w", podName, getErr) return }

🤖 Prompt for AI Agents

In controllers/tests/controller/selfnoderemediation_controller_test.go around lines 875 to 885, the variable err is redeclared inside the if block, shadowing the named return err and causing the function to return nil even on failure. To fix this, remove the short variable declaration (:=) and use assignment (=) for err inside the if block so that the named return variable is correctly updated and returned.

…#251 exists and is a real problem.

…y and getWorkerPeersResponse for issue medik8s#251

…mmand.

openshift-ci · 2025-06-23T14:07:54Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: mark-dell-usa
Once this PR has been reviewed and has the lgtm label, please assign razo7 for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (4)

controllers/tests/shared/shared.go (2)
82-89: Random IP helper forgets to update .Status.PodIP

assignRandomIpAddressesPods() populates PodIPs, but callers relying on the legacy singular field (pod.Status.PodIP) will still observe an empty value.
Several components (including controller-runtime caches and some older network add-ons) still look at PodIP when only one address is expected.
Consider mirroring the first generated address into both fields:
 func assignRandomIpAddressesPods(pods *corev1.PodList) {
   for i := range pods.Items {
-    pods.Items[i].Status.PodIPs = []corev1.PodIP{{IP: GetRandomIpAddress()}}
+    randIP := GetRandomIpAddress()
+    pods.Items[i].Status.PodIP = randIP        // keep legacy field in sync
+    pods.Items[i].Status.PodIPs = []corev1.PodIP{{IP: randIP}}
   }
 }
116-128: Deadlock risk: simulated peer sends > 1 message to a bounded channel

getHealthStatusFromPeers() allocates resultsChan with capacity = len(addresses) and waits for exactly one message per peer.
The current override loops over ckw.SimulatePeerResponses and may write multiple codes, causing the goroutine to block on the second send and the whole test to hang.
-       case ckw.ShouldSimulatePeerResponses:
-           for _, code := range ckw.SimulatePeerResponses {
-               results <- code
-           }
-           return
+       case ckw.ShouldSimulatePeerResponses:
+           // Provide exactly one deterministic response per call
+           if len(ckw.SimulatePeerResponses) > 0 {
+               results <- ckw.SimulatePeerResponses[0]
+           } else {
+               results <- selfNodeRemediation.RequestFailed
+           }
+           return
If you need per-peer variance, consider indexing into the slice with a counter guarded by a mutex instead of pushing multiple values at once.
controllers/tests/controller/selfnoderemediation_controller_test.go (2)
875-885: Variable shadowing hides the named return value and loses context

The inner err := … shadows the function's named err return value, then the assignment err = … inside the block modifies the shadowed variable, not the function return.
As a result, the caller may receive a nil error even though the Get failed.
-		if err := k8sClient.Client.Get(context.Background(), podKey, pod); err != nil {
-			if apierrors.IsNotFound(err) && !throwErrorIfNotFound {
+		getErr := k8sClient.Client.Get(context.Background(), podKey, pod)
+		if getErr != nil {
+			if apierrors.IsNotFound(getErr) && !throwErrorIfNotFound {
 				logf.Log.Info("pod with name '%s' not found, we're not going to do anything", podName)
 				err = nil
 				return
 			}
-
-			err = fmt.Errorf("unable to get pod with name '%s' in order to delete it", err)
+			err = fmt.Errorf("unable to get pod with name '%s' in order to delete it: %w", podName, getErr)
 			return
 		}
1197-1201: Cleanup doesn't persist the label rollback – subsequent tests may inherit the control-plane label

Inside configureUnhealthyNodeAsControlNode, the DeferCleanup callback only changes the in-memory unhealthyNode object but never updates the cluster.
If another test starts before this process ends (or if the global state is inspected in verifyCleanState), the node will still carry the node-role.kubernetes.io/master=true label, leading to flaky or misleading results.
-				unhealthyNode.Labels[labels2.MasterRole] = previousRole
+				unhealthyNode.Labels[labels2.MasterRole] = previousRole
+				// Persist the rollback so follow-up tests start from
+				// a clean state
+				Expect(k8sClient.Update(context.TODO(), unhealthyNode)).To(Succeed(),
+					"failed to restore the unhealthy node label after test cleanup")

🧹 Nitpick comments (2)

pkg/apicheck/check.go (2)
72-114: Clean up redundant return and outdated TODO

Two minor improvements:

Remove the redundant return statement at line 110 (flagged by static analysis)

Remove the outdated TODO comment since IPv6 compatibility is confirmed
-		// TODO does this work with IPv6?
-		// MES: Yes it does, we've tested this
 		phClient, err := peerhealth.NewClient(fmt.Sprintf("%v:%v", endpointIp.IP, c.config.PeerHealthPort), c.config.PeerDialTimeout, c.config.Log.WithName("peerhealth client"), c.clientCreds)
 
 		...
 
 		results <- selfNodeRemediation.HealthCheckResponseCode(resp.Status)
-		return
 	}
173-215: Well-structured health check logic with improved observability

The refactored isConsideredHealthy method provides clear separation between worker and control plane logic with excellent logging for debugging. The implementation aligns with the discussed flow from past reviews.

Minor suggestion: Consider extracting the complex boolean expression on line 177 into a named variable for clarity:
-	isWorkerNode := isControlPlaneManagerNil || !c.controlPlaneManager.IsControlPlane()
+	isControlPlane := !isControlPlaneManagerNil && c.controlPlaneManager.IsControlPlane()
+	isWorkerNode := !isControlPlane

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6397c3e and befbcbb.

⛔ Files ignored due to path filters (2)

pkg/peerhealth/peerhealth.pb.go is excluded by !**/*.pb.go
pkg/peerhealth/peerhealth_grpc.pb.go is excluded by !**/*.pb.go

📒 Files selected for processing (12)

controllers/selfnoderemediation_controller.go (3 hunks)
controllers/tests/config/suite_test.go (1 hunks)
controllers/tests/controller/selfnoderemediation_controller_test.go (11 hunks)
controllers/tests/controller/suite_test.go (3 hunks)
controllers/tests/shared/shared.go (3 hunks)
go.mod (1 hunks)
pkg/apicheck/check.go (5 hunks)
pkg/controlplane/manager.go (2 hunks)
pkg/peers/peers.go (4 hunks)
pkg/utils/pods.go (3 hunks)
vendor/github.com/onsi/gomega/gcustom/make_matcher.go (1 hunks)
vendor/modules.txt (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (8)

pkg/controlplane/manager.go
go.mod
pkg/utils/pods.go
controllers/tests/config/suite_test.go
vendor/modules.txt
controllers/tests/controller/suite_test.go
controllers/selfnoderemediation_controller.go
pkg/peers/peers.go

🧰 Additional context used

🪛 golangci-lint (1.64.8)

pkg/apicheck/check.go

110-110: S1023: redundant return statement

(gosimple)

🔇 Additional comments (8)

vendor/github.com/onsi/gomega/gcustom/make_matcher.go (1)

1-271: Vendor dependency addition looks good

This vendor file adds the gcustom package from Gomega v1.27.0+ which provides custom matcher functionality. The integration with the test suite is appropriate for creating the new error matchers in shared.go.

Note: Static analysis detected a redundant return statement at line 110, but since this is vendor code, it should be addressed upstream rather than modified locally.

controllers/tests/shared/shared.go (1)

167-206: Well-implemented custom error matchers

The custom Gomega matchers for Kubernetes errors are cleanly implemented with proper nil handling. This improves test readability and reusability.

pkg/apicheck/check.go (2)

28-40: Excellent refactoring for testability

The introduction of GetHealthStatusFromRemoteFunc and the function injection pattern improves testability while maintaining backward compatibility with the default implementation. This follows the dependency injection principle effectively.

Also applies to: 59-70

217-322: Good refactoring to support role-based peer querying

The generalization from worker-specific to role-based peer querying improves code reusability. The additional logging statements will be valuable for troubleshooting peer communication issues in production.

controllers/tests/controller/selfnoderemediation_controller_test.go (4)

39-82: Excellent test setup refactoring

The restructured setup with descriptive By blocks and comprehensive cleanup improves test maintainability and debugging. The separation of concerns between BeforeEach and JustBeforeEach is well-organized.

448-495: Well-structured test case for issue #251

This test effectively reproduces the control plane peer health check scenario described in issue #251. The setup clearly configures the environment with control plane nodes and simulated unhealthy responses.

1134-1337: Excellent test helper functions

The new helper functions provide clean abstractions for test configuration with proper cleanup. The consistent use of DeferCleanup ensures test isolation, and the descriptive function names make test scenarios self-documenting.

1003-1041: Comprehensive cluster state verification

The enhanced verifyCleanState function thoroughly validates the cluster state after tests, including checking for specific SNR pods by name. This will help catch cleanup issues early.

coderabbitai · 2025-06-23T14:40:32Z

controllers/tests/shared/shared.go

+func GetRandomIpAddress() (randomIP string) {
+	u := uuid.New()
+	ip := net.IP(u[:net.IPv6len])
+	randomIP = ip.String()
+
+	return
+}


⚠️ Potential issue

Use proper IPv6 address generation instead of UUID bytes

Converting UUID bytes directly to IPv6 addresses may generate invalid or reserved addresses. Consider using proper IPv6 address generation that respects network standards:

func GetRandomIpAddress() (randomIP string) { - u := uuid.New() - ip := net.IP(u[:net.IPv6len]) - randomIP = ip.String() + // Generate a random IPv6 address in the fd00::/8 range (Unique Local Address) + bytes := make([]byte, net.IPv6len) + bytes[0] = 0xfd // ULA prefix + _, _ = rand.Read(bytes[1:]) + ip := net.IP(bytes) + randomIP = ip.String() return }

This ensures the generated addresses are valid and in a proper test range.

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents

In controllers/tests/shared/shared.go around lines 101 to 107, the current method generates an IPv6 address by directly converting UUID bytes, which can produce invalid or reserved addresses. Replace this approach with a proper IPv6 address generation method that respects network standards, such as generating addresses within a designated test range or using a library function designed for valid IPv6 creation. This will ensure the generated IPs are valid and suitable for testing.

mark-dell-usa · 2025-06-23T16:06:29Z

Self Node Remediation - Control Plane Edge Case Analysis

Overview

This document analyzes a critical edge case in the Self Node Remediation (SNR) operator where a control plane node becomes partially degraded but never remediates itself. This occurs when:

The control plane's Kubernetes API server is down
Most other Kubernetes services (controller-manager, scheduler) are offline
The SNR daemonset pod remains alive and functional
The kubelet service continues running

The Edge Case Scenario

Initial State

Control Plane Node
├── API Server: DOWN ❌
├── Controller Manager: DOWN ❌
├── Scheduler: DOWN ❌
├── Kubelet: RUNNING ✓
└── SNR Pod: RUNNING ✓

Workflow Diagrams

1. Normal SNR Health Check Flow

flowchart TD
    Start([SNR Pod Health Check Timer]) --> APICheck[Check API Server /readyz]
    APICheck -->|Success| Healthy[Mark as Healthy]
    APICheck -->|Failure| ErrorCount{Error Count > Threshold?}
    ErrorCount -->|No| IncrementError[Increment Error Counter]
    IncrementError --> Wait[Wait for Next Check]
    ErrorCount -->|Yes| PeerCheck[Query Peer Nodes]
    
    PeerCheck --> PeerResponse{Peer Response Analysis}
    PeerResponse -->|Majority say Unhealthy| CreateSNR[Trigger Remediation]
    PeerResponse -->|Majority say Healthy| ResetCounter[Reset Error Counter]
    PeerResponse -->|Most can't reach API| ControlPlaneCheck{Is Control Plane?}
    
    ControlPlaneCheck -->|Worker Node| ConsiderHealthy[Consider Healthy]
    ControlPlaneCheck -->|Control Plane| RunDiagnostics[Run Diagnostics]
    
    RunDiagnostics --> DiagResult{Diagnostics Pass?}
    DiagResult -->|Yes| ConsiderHealthy
    DiagResult -->|No| CreateSNR
    
    ResetCounter --> Wait
    ConsiderHealthy --> Wait
    CreateSNR --> Remediate[Begin Remediation Process]

2. Control Plane Edge Case Flow (The Bug)

flowchart TD
    Start([Control Plane Node:<br/>API Server DOWN]) --> APIFails[API Check Fails Repeatedly]
    APIFails --> QueryPeers[Query Worker Peers]
    
    QueryPeers --> PeerStatus{Worker Peer Responses}
    PeerStatus -->|">50% also can't<br/>reach API Server"| MostCantAccess[Status: HealthyBecauseMostPeersCantAccessAPIServer]
    
    MostCantAccess --> CPDiag[Run Control Plane Diagnostics]
    
    CPDiag --> EndpointCheck{Check Endpoint<br/>Health URL}
    EndpointCheck -->|Not Configured or<br/>Was Never Accessible| EndpointPass[Endpoint Check: PASS]
    
    EndpointPass --> KubeletCheck{Is Kubelet<br/>Running?}
    KubeletCheck -->|Port 10250<br/>Responds| KubeletPass[Kubelet Check: PASS]
    
    KubeletPass --> DiagPass[Diagnostics: PASSED ✓]
    DiagPass --> MarkHealthy[Node Marked as HEALTHY]
    MarkHealthy --> NoRemediation[❌ NO REMEDIATION TRIGGERED]
    
    NoRemediation --> Impact[Control Plane Remains Broken:<br/>- No API Server<br/>- No Scheduling<br/>- No Controllers<br/>- Cluster Partially Down]

3. Peer Health Check Details

sequenceDiagram
    participant CP as Control Plane<br/>(Broken API)
    participant W1 as Worker 1
    participant W2 as Worker 2
    participant W3 as Worker 3
    
    Note over CP: API Check Fails
    CP->>W1: Is my SNR CR present?
    CP->>W2: Is my SNR CR present?
    CP->>W3: Is my SNR CR present?
    
    W1--xCP: Error: Can't reach API
    W2--xCP: Error: Can't reach API
    W3--xCP: Error: Can't reach API
    
    Note over CP: >50% peers have API errors
    Note over CP: Status = HealthyBecauseMostPeersCantAccessAPIServer
    Note over CP: Run Diagnostics:
    Note over CP: ✓ Kubelet Running
    Note over CP: ✓ No Endpoint URL
    Note over CP: = HEALTHY (Bug!)

4. The Diagnostic Gap

flowchart LR
    subgraph "Current Diagnostics"
        D1[Endpoint Health Check]
        D2[Kubelet Service Check]
    end
    
    subgraph "Missing Checks"
        M1[API Server Process]
        M2[Controller Manager]
        M3[Scheduler]
        M4[Etcd Connectivity]
    end
    
    subgraph "Result"
        R1[False Positive:<br/>Node Considered Healthy<br/>Despite Being Non-Functional]
    end
    
    D1 --> R1
    D2 --> R1
    M1 -.->|Should Check| R1
    M2 -.->|Should Check| R1
    M3 -.->|Should Check| R1
    M4 -.->|Should Check| R1

Code Analysis

The Bug Location

File: pkg/controlplane/manager.go

func (manager *Manager) IsControlPlaneHealthy(workerPeersResponse peers.Response, 
    canOtherControlPlanesBeReached bool) bool {
    
    switch workerPeersResponse.ReasonCode {
    // ...
    case peers.HealthyBecauseMostPeersCantAccessAPIServer:
        didDiagnosticsPass := manager.isDiagnosticsPassed()
        manager.log.Info("The peers couldn't access the API server, so we are returning whether "+
            "diagnostics passed", "didDiagnosticsPass", didDiagnosticsPass)
        return didDiagnosticsPass  // <-- BUG: Returns true if kubelet is running
    // ...
    }
}

func (manager *Manager) isDiagnosticsPassed() bool {
    manager.log.Info("Starting control-plane node diagnostics")
    if manager.isEndpointAccessLost() {
        return false
    } else if !manager.isKubeletServiceRunning() {  // <-- Only checks kubelet!
        return false
    }
    manager.log.Info("Control-plane node diagnostics passed successfully")
    return true
}

The Problem

The diagnostics only check:

Endpoint Health: Often not configured or was never accessible
Kubelet Service: Can be running even when API server is dead

They DO NOT check:

Is the API server process running?
Is the API server responding locally?
Are other control plane components functional?

Impact

This edge case creates a situation where:

Cluster Degradation: The control plane node is effectively non-functional
No Self-Healing: SNR never triggers remediation
Manual Intervention Required: Operators must manually detect and fix
Cascading Failures: Other nodes may start failing due to missing control plane

References

Issue: Remediation doesn't occur when node can contact peer control plane nodes even if they consider it unhealthy #251
Related medik8s documentation: https://www.medik8s.io/remediation/self-node-remediation/how-it-works/

Conclusion

The current SNR implementation has a critical gap in control plane node health assessment. When the API server fails but kubelet remains running, the node incorrectly considers itself healthy, preventing automatic remediation. This edge case requires either enhanced diagnostics that specifically check control plane components or a fundamental change in how control plane node health is evaluated during cluster-wide API server outages.

mark-dell-usa · 2025-06-23T16:07:22Z

Find an updated analysis here of what I did, perhaps it will help, I attempted to pull it together with diagrams to make it clear.

https://gist.github.com/mark-dell-usa/576901e5cc420114bcf1076ff5d57f52

openshift-ci bot added the do-not-merge/work-in-progress label Mar 26, 2025

openshift-ci bot requested review from clobrano and slintes March 26, 2025 22:10

openshift-ci bot added the needs-ok-to-test label Mar 26, 2025

slintes reviewed Apr 1, 2025

View reviewed changes

mshitrit reviewed Apr 1, 2025

View reviewed changes

mark-dell-usa force-pushed the fix_issue_251_peer_control_nodes branch from eb2c8c8 to 6397c3e Compare April 30, 2025 14:33

mshitrit reviewed May 4, 2025

View reviewed changes

openshift-merge-robot added the needs-rebase label May 15, 2025

coderabbitai bot reviewed May 15, 2025

View reviewed changes

mark-dell-usa added 4 commits June 23, 2025 09:01

Added new failing test to showcase bug ticket

958c991

This test intentionally breaks, because it proves that defect medik8s…

36f642b

…#251 exists and is a real problem.

Fixing breaking unit test with proposed changes to IsConsideredHealth…

477d14c

…y and getWorkerPeersResponse for issue medik8s#251

gofmt updates to watchdog_test after running the local "make test" co…

befbcbb

…mmand.

mark-dell-usa force-pushed the fix_issue_251_peer_control_nodes branch from 6397c3e to befbcbb Compare June 23, 2025 14:05

openshift-merge-robot removed the needs-rebase label Jun 23, 2025

coderabbitai bot reviewed Jun 23, 2025

View reviewed changes

[WIP] Fix issue 251 peer control nodes #255

Are you sure you want to change the base?

[WIP] Fix issue 251 peer control nodes #255

Uh oh!

Conversation

mark-dell-usa commented Mar 26, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why we need this PR

Changes made

Which issue(s) this PR fixes

Test plan

Summary by CodeRabbit

Uh oh!

openshift-ci bot commented Mar 26, 2025

Uh oh!

slintes left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mshitrit Apr 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

slintes May 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot commented May 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Assessment against linked issues

Poem

Chat

Support

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot May 15, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot May 15, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot May 15, 2025

Choose a reason for hiding this comment

Uh oh!

mark-dell-usa commented Mar 26, 2025 •

edited by coderabbitai bot

Loading

mshitrit Apr 16, 2025 •

edited

Loading

slintes May 13, 2025 •

edited

Loading

coderabbitai bot commented May 15, 2025 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)