Skip to content

Conversation

invidian
Copy link
Member

This commit provides PoC version of implementing agent waiting for all
volumtes attached to the node to be detached as a step after draining
the node, as shutting down the Pod does not mean the volume has been
detached, as usually CSI agent will be running as a DaemonSet on the
node and will take care of detaching the volume from the node when the
pod shuts down.

This commit improves rebooting experience, as right now if there is not
enough time for CSI agent to detach the volumes from the node, node gets
rebooted and pods using attached volumes have no way to be attached to
other nodes, which effectively increases the downtime caused for
stateful workloads.

This commit still requires tests and better interface for the users.

If someone wants to try this feature on their own cluster, I've
published the following image I've been testing with:

quay.io/invidian/flatcar-linux-update-operator:97c0dee50c807dbba7d2debc59b369f84002797e

Closes #30

Signed-off-by: Mateusz Gozdek [email protected]

@invidian
Copy link
Member Author

We should also consider compatibility with k8s versions before merging.

@invidian
Copy link
Member Author

invidian commented Jun 26, 2022

Just hit some issue with this code:

  1. Draining failed because one workload couldn't satisfy the PDB.
  2. Waiting for volume detachment never finished.

Perhaps we should also have some timeout while waiting for volumes to be detached.

This commit provides PoC version of implementing agent waiting for all
volumtes attached to the node to be detached as a step after draining
the node, as shutting down the Pod does not mean the volume has been
detached, as usually CSI agent will be running as a DaemonSet on the
node and will take care of detaching the volume from the node when the
pod shuts down.

This commit improves rebooting experience, as right now if there is not
enough time for CSI agent to detach the volumes from the node, node gets
rebooted and pods using attached volumes have no way to be attached to
other nodes, which effectively increases the downtime caused for
stateful workloads.

This commit still requires tests and better interface for the users.

If someone wants to try this feature on their own cluster, I've
published the following image I've been testing with:

quay.io/invidian/flatcar-linux-update-operator:97c0dee50c807dbba7d2debc59b369f84002797e

Closes #30

Signed-off-by: Mateusz Gozdek <[email protected]>
@invidian invidian force-pushed the invidian/wait-for-volumes-detach branch from 2affae3 to 88957e7 Compare January 11, 2023 17:47
@invidian
Copy link
Member Author

I also found a bug with RBAC which is now fixed.


klog.Info("Node drained, rebooting")

for {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should make this optional behind a opt-in flag.

klog.Errorf("Ignoring node drain error and proceeding with reboot: %v", err)
}

klog.Info("Node drained, rebooting")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This message needs to be adjusted (or perhaps moved below the volumes detachment).

Comment on lines +296 to +297
klog.Errorf("Listing volume attachments: %v", err)
continue
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably add some mechanism here to give up, for example a timeout.

break
}

time.Sleep(5 * time.Second)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps this should also be adjustable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

Successfully merging this pull request may close these issues.

Ensure update-agent waits for all volumes to be detached before rebooting
1 participant