Remediation doesn't occur when node can contact peer control plane nodes even if they consider it unhealthy

### Issue

In some configurations, despite a node being marked as unhealthy, a control plane node that still has access to peer control plane nodes but has lost API server access will not remediate.

### Scenario:

- 3 control plane nodes
- 3 worker nodes
- control plane node partitioned across two disks, one physical, another served by removable RAID based storage
- (not sure it matters but for full context) /var is mapped onto the removable storage as well as the root.  /etc and /usr and other critical sections are on the OS disk
    - In this case, we happen to be running rke2 services.  with the configuration on our nodes, that puts a bunch of k8s and related binaries into the removable storage such as:
        - containerd
        - crictl
        - kubectl
        - runc


### Steps to reproduce

- Provision environment with working cluster
- Configure NHC & SNR - in this case default configuration should be fine.  Our goal was to reboot the control plane node to ensure all workloads were off of it
- On one of the control plane nodes (we'll call it control1), remove or disable the removable storage
- After this occurs, a significant number of key services go offline due to removed storage.  rke2-server, a variety of pods that rely on storage
- We did find that the self-node-remediation pod stayed living in this case (have to access it by looking up the host PID for the process and getting detail from it's process info in /proc/n/*
- We observe that SNR continues to operate, and never remediates (log snippets reproduced below).


### Detail
We believe that SNR *should* have remediated, but likely there is a bug in the logic flow.  What we are seeing is that SNR is:

1. Eventually timing out on API server access
2. Getting health statuses from peer.  All indicating unhealthy
3. Deciding on "peers did not confirm that we are unhealthy, ignoring error"
    - I have no clue how it got from number 2 above to number 3 without additional logging statements being emitted
    - It lives in this state forever

It seems like to get to number 3, it must go through [manager.go:66](https://github.com/medik8s/self-node-remediation/blob/16052be84bbfe9b1c9e791d6e05a9eae45d9655a/pkg/controlplane/manager.go#L66)  which returns true if the control plane was reachable.  But if that were the case, it would have had to go through [check.go:178](https://github.com/medik8s/self-node-remediation/blob/16052be84bbfe9b1c9e791d6e05a9eae45d9655a/pkg/apicheck/check.go#L178) which I should have seen in the log output.

Note: I admit we are using an outdated version of SNR because we are still attempting to get issue #238 merged, and as such we haven't recompiled the SNR binaries for about a year.  I did look at changes for the relevent functions since our version and current mainline, and didn't see any smoking guns having been fixed since then, but admit it's not a guarantee.

I've added some additional logging calls in another commit for #238  that would have provided some additional diagnostics and should highlight it better in the future.

[snr.log](https://github.com/user-attachments/files/18567081/snr.log)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Remediation doesn't occur when node can contact peer control plane nodes even if they consider it unhealthy #251

Issue

Scenario:

Steps to reproduce

Detail

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Remediation doesn't occur when node can contact peer control plane nodes even if they consider it unhealthy #251

Description

Issue

Scenario:

Steps to reproduce

Detail

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions