-
Notifications
You must be signed in to change notification settings - Fork 22
Description
Issue
In some configurations, despite a node being marked as unhealthy, a control plane node that still has access to peer control plane nodes but has lost API server access will not remediate.
Scenario:
- 3 control plane nodes
- 3 worker nodes
- control plane node partitioned across two disks, one physical, another served by removable RAID based storage
- (not sure it matters but for full context) /var is mapped onto the removable storage as well as the root. /etc and /usr and other critical sections are on the OS disk
- In this case, we happen to be running rke2 services. with the configuration on our nodes, that puts a bunch of k8s and related binaries into the removable storage such as:
- containerd
- crictl
- kubectl
- runc
- In this case, we happen to be running rke2 services. with the configuration on our nodes, that puts a bunch of k8s and related binaries into the removable storage such as:
Steps to reproduce
- Provision environment with working cluster
- Configure NHC & SNR - in this case default configuration should be fine. Our goal was to reboot the control plane node to ensure all workloads were off of it
- On one of the control plane nodes (we'll call it control1), remove or disable the removable storage
- After this occurs, a significant number of key services go offline due to removed storage. rke2-server, a variety of pods that rely on storage
- We did find that the self-node-remediation pod stayed living in this case (have to access it by looking up the host PID for the process and getting detail from it's process info in /proc/n/*
- We observe that SNR continues to operate, and never remediates (log snippets reproduced below).
Detail
We believe that SNR should have remediated, but likely there is a bug in the logic flow. What we are seeing is that SNR is:
- Eventually timing out on API server access
- Getting health statuses from peer. All indicating unhealthy
- Deciding on "peers did not confirm that we are unhealthy, ignoring error"
- I have no clue how it got from number 2 above to number 3 without additional logging statements being emitted
- It lives in this state forever
It seems like to get to number 3, it must go through manager.go:66 which returns true if the control plane was reachable. But if that were the case, it would have had to go through check.go:178 which I should have seen in the log output.
Note: I admit we are using an outdated version of SNR because we are still attempting to get issue #238 merged, and as such we haven't recompiled the SNR binaries for about a year. I did look at changes for the relevent functions since our version and current mainline, and didn't see any smoking guns having been fixed since then, but admit it's not a guarantee.
I've added some additional logging calls in another commit for #238 that would have provided some additional diagnostics and should highlight it better in the future.