-
Notifications
You must be signed in to change notification settings - Fork 20
WIP: pkg/agent: wait for all volumes to be detached before rebooting #169
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
We should also consider compatibility with k8s versions before merging. |
Just hit some issue with this code:
Perhaps we should also have some timeout while waiting for volumes to be detached. |
This commit provides PoC version of implementing agent waiting for all volumtes attached to the node to be detached as a step after draining the node, as shutting down the Pod does not mean the volume has been detached, as usually CSI agent will be running as a DaemonSet on the node and will take care of detaching the volume from the node when the pod shuts down. This commit improves rebooting experience, as right now if there is not enough time for CSI agent to detach the volumes from the node, node gets rebooted and pods using attached volumes have no way to be attached to other nodes, which effectively increases the downtime caused for stateful workloads. This commit still requires tests and better interface for the users. If someone wants to try this feature on their own cluster, I've published the following image I've been testing with: quay.io/invidian/flatcar-linux-update-operator:97c0dee50c807dbba7d2debc59b369f84002797e Closes #30 Signed-off-by: Mateusz Gozdek <[email protected]>
2affae3
to
88957e7
Compare
I also found a bug with RBAC which is now fixed. |
|
||
klog.Info("Node drained, rebooting") | ||
|
||
for { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should make this optional behind a opt-in flag.
klog.Errorf("Ignoring node drain error and proceeding with reboot: %v", err) | ||
} | ||
|
||
klog.Info("Node drained, rebooting") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This message needs to be adjusted (or perhaps moved below the volumes detachment).
klog.Errorf("Listing volume attachments: %v", err) | ||
continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably add some mechanism here to give up, for example a timeout.
break | ||
} | ||
|
||
time.Sleep(5 * time.Second) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps this should also be adjustable.
This commit provides PoC version of implementing agent waiting for all
volumtes attached to the node to be detached as a step after draining
the node, as shutting down the Pod does not mean the volume has been
detached, as usually CSI agent will be running as a DaemonSet on the
node and will take care of detaching the volume from the node when the
pod shuts down.
This commit improves rebooting experience, as right now if there is not
enough time for CSI agent to detach the volumes from the node, node gets
rebooted and pods using attached volumes have no way to be attached to
other nodes, which effectively increases the downtime caused for
stateful workloads.
This commit still requires tests and better interface for the users.
If someone wants to try this feature on their own cluster, I've
published the following image I've been testing with:
quay.io/invidian/flatcar-linux-update-operator:97c0dee50c807dbba7d2debc59b369f84002797e
Closes #30
Signed-off-by: Mateusz Gozdek [email protected]