-
Notifications
You must be signed in to change notification settings - Fork 207
Description
We're getting close to shipping the initial control-plane-driven update, and we've run across several kinds of edge cases (almost all, or entirely all, races of one kind or another) that we think need help from the SP to definitively address. We don't think any of these are ship blockers for update, or even particularly likely in practice, but it's uncomfortable to know that they exist. This issue is something of a wish list or list of issues, depending on your perspective.
I'm opening this as a single issue because these may warrant a more holistic approach in terms of extending the MGS <-> control plane interface than we've previously done. We could probably address all of these by adding new message types that provide similar mechanics with additional safeties, but maybe it makes sense to do something different? (E.g., for the first issue below, maybe every message from MGS to the SP should include either "I don't know or care who you are - please tell me as part of your response" or "I believe you are $SO_AND_SO
; if you're not, please reject this request and tell me who you actually are".)
If we want to do nontrivial rework of the SP <-> MGS protocol, it almost certainly warrants an RFD. But I wanted to at least jot down these things we've realized.
If we make a decision to perform an operation on a target sled in a particular cubby, how do we know when we actually perform that operation that the sled in the cubby is still the same one we think it is? (@davepacheco originally filed this as oxidecomputer/management-gateway-service#141.)
When we start an update, the caller provides a unique ID associated with that update, and while that update proceeds, all requests must come with the same ID or they're rejected. This is an example of the SP doing exactly what we want: this feature allows multiple Nexuses to race in attempting to start updates, knowing that the SP mediates and exactly one of them will win, and the others get back an error that they can understand. However, after completing an update, we lose these guarantees, but inconsistently at the moment. In this sequence:
- Nexus A and Nexus B both want to send an update to a target SP.
- A's request to update gets to the SP first, and it proceeds.
- If B attempts to start an update while A's update is still running, it will fail (update ID mismatch).
- If B attempts to start an update just after A's update completes, it may succeed or fail depending on the target component. (Host phase 1 will succeed; SP and RoT will fail due to Allow ovewrite of updated images without requiring a Reset #1022, although this failure is actually a good thing in this context.)
Closely related to the previous issue: when performing an operation after an update (e.g., resetting the RoT or SP, changing the host phase 1 mux, rebooting the host), how do we know some other Nexus hasn't snuck in and started or completed a different update in the meantime? Currently the SP / RoT block this due to #1022, but the host flash has no such protection.
I think for both of these cases, it may make sense to extend the update ID to other operations? E.g., if I've just sent a new host phase 1 to slot 1, I think I want to be able to say "switch the active slot to 1 if and only if the contents are still what I provided in update $ID", and then say the same thing for reset? (Unclear whether "switch mux and reset" could or should be a single operation exposed by the SP to MGS.)
Today to power cycle a host, Nexus sends a "go to A2", sleeps briefly, then sends a "go to A0". (This is exactly what pilot sp cycle
does, because we copied it.) When the host itself wants to reboot, it sends a "reboot" request over IPCC, and the SP handles the A2 -> sleep -> A0 cycle internally. We should support a "reset the host" operation similarly via MGS.
Without this, any client (i.e., Nexus) wanting to power cycle a host has to store the intended power state elsewhere to ensure the "power it back on" happens in the event of the client crashing in the middle of the sequence of operations.
MGS will retry operations if the SP doesn't respond; if we're going to do some rework or rethinking here, we should make sure we account for retries (probably by making operations idempotent if possible? that's how we've handled this to date, although I doubt we've actually ensured every MGS -> SP operation is truly idempotent).