Skip to content

Remediation doesn't occur when node can contact peer control plane nodes even if they consider it unhealthy #251

@novasbc

Description

@novasbc

Issue

In some configurations, despite a node being marked as unhealthy, a control plane node that still has access to peer control plane nodes but has lost API server access will not remediate.

Scenario:

  • 3 control plane nodes
  • 3 worker nodes
  • control plane node partitioned across two disks, one physical, another served by removable RAID based storage
  • (not sure it matters but for full context) /var is mapped onto the removable storage as well as the root. /etc and /usr and other critical sections are on the OS disk
    • In this case, we happen to be running rke2 services. with the configuration on our nodes, that puts a bunch of k8s and related binaries into the removable storage such as:
      • containerd
      • crictl
      • kubectl
      • runc

Steps to reproduce

  • Provision environment with working cluster
  • Configure NHC & SNR - in this case default configuration should be fine. Our goal was to reboot the control plane node to ensure all workloads were off of it
  • On one of the control plane nodes (we'll call it control1), remove or disable the removable storage
  • After this occurs, a significant number of key services go offline due to removed storage. rke2-server, a variety of pods that rely on storage
  • We did find that the self-node-remediation pod stayed living in this case (have to access it by looking up the host PID for the process and getting detail from it's process info in /proc/n/*
  • We observe that SNR continues to operate, and never remediates (log snippets reproduced below).

Detail

We believe that SNR should have remediated, but likely there is a bug in the logic flow. What we are seeing is that SNR is:

  1. Eventually timing out on API server access
  2. Getting health statuses from peer. All indicating unhealthy
  3. Deciding on "peers did not confirm that we are unhealthy, ignoring error"
    • I have no clue how it got from number 2 above to number 3 without additional logging statements being emitted
    • It lives in this state forever

It seems like to get to number 3, it must go through manager.go:66 which returns true if the control plane was reachable. But if that were the case, it would have had to go through check.go:178 which I should have seen in the log output.

Note: I admit we are using an outdated version of SNR because we are still attempting to get issue #238 merged, and as such we haven't recompiled the SNR binaries for about a year. I did look at changes for the relevent functions since our version and current mainline, and didn't see any smoking guns having been fixed since then, but admit it's not a guarantee.

I've added some additional logging calls in another commit for #238 that would have provided some additional diagnostics and should highlight it better in the future.

snr.log

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions