[40/n] blueprint planner logic + sled agent code to honor mupdate overrides #8456

sunshowers · 2025-06-26T05:26:07Z

Start implementing the blueprint planner logic for mupdate overrides, as well as Sled Agent logic for honoring them.

There's still a lot of work to do:

Ensure the logic is sound, especially around errors, missing inventory, etc.
~~Move zone image sources back to artifacts once we know they've been distributed to a sled. (How do we know they've been distributed to a sled?)~~ Done in [24/n] [reconfigurator-planning] support no-op image source updates #8486.
Is the set of reasons for why not to proceed with other steps accurate? To me it does seem like we need to wait until both conditions are met, but I'd like to check.
Lots more tests.
I don't think this can land before the logic to reset the mupdate override field in sled-agent lands. I also think this may have to land simultaneously with the code to redirect to the install dataset within sled-agent.
- This PR combines blueprint planner logic with the sled agent code to honor mupdate overrides. This avoids intermediate broken states.
Doing this for r16 is somewhat risky -- investigate a chicken switch where this logic isn't activated on customer systems but is in dogfood and on racklettes.
- (kind of) done in [35/n] [reconfigurator] add add_zones_with_mupdate_override chicken switch #8648
Write reconfigurator-cli tests for obeying chicken switches.
In the future, we'll also want to block clearing mupdate overrides on us having matched up all artifacts to the TUF repo.
Test on a racklette.

Created using spr 1.3.6-beta.1

sunshowers · 2025-06-26T05:50:22Z

nexus/reconfigurator/planning/src/planner.rs

+        let mut sleds_with_override = BTreeSet::new();
+        for sled_id in self.input.all_sled_ids(SledFilter::InService) {


Thinking about this, I'm wondering what would happen here if a sled goes away in the middle of this process, disappearing from inventory. In that case the remove_mupdate_override field never gets cleared from the blueprint.

We would do:

expunge the sled in the first blueprint

when executed, the sled policy will be updated to Expunged in the planning input

then next planning cycle it'll no longer be in the InService set

So I think we'll eventually converge -- it'll just take a couple cycles. (A TODO is to add a test for this.)

Tested this with the new inventory-hidden and inventory-visible subcommands in reconfigurator-cli.

sunshowers · 2025-06-26T05:56:14Z

nexus/reconfigurator/planning/src/blueprint_editor/sled_editor.rs

+                    let old_image_source = self.zones.set_zone_image_source(
+                        &zone_id,
+                        BlueprintZoneImageSource::InstallDataset,
+                    )?;


RFD 556 says:

Wherever the planner uses the target release, it is instead ignored if its generation number is not greater than min_release_generation (if set).

As discussed in Tuesday's watercooler it's a bit more complex than that -- what we want to do is to only use the install dataset on sleds that have been mupdated, since on other sleds the install dataset may be woefully out of date.

I think I want to make the claim that this code may actually be sufficient as it stands. I don't think we need to try and do any other redirects other than this one (which is admittedly edge-triggered), as long as we prevent new zones from being set up at all while the system is recovering from the mupdate.

We decided to not proceed with adding new zones until the mupdate override has been completely cleared.

jgallagher · 2025-06-26T15:35:31Z

nexus/reconfigurator/planning/src/blueprint_editor/sled_editor.rs

+                // override that was set in the above branch. We can remove the
+                // override from the blueprint.
+                self.set_remove_mupdate_override(None);
+                // TODO: change zone sources from InstallDataset to Artifact


I'm not sure we'll need to do this here; the normal upgrade path should change zone sources already, right? (It just needs to not do that while a mupdate override is in place.)

Although maybe sled-agent should do something like "if I'm changing from install dataset with hash X to artifact with hash X, don't actually bounce the zone".

I'm not sure we'll need to do this here; the normal upgrade path should change zone sources already, right? (It just needs to not do that while a mupdate override is in place.)

Good question -- we do this one zone at a time currently, and I guess this would be an opportunity to do a bulk replace. (But why not always do a level-triggered bulk replace?)

Although maybe sled-agent should do something like "if I'm changing from install dataset with hash X to artifact with hash X, don't actually bounce the zone".

Yeah, this is reasonable.

Resolved in #8486, and TODO removed.

jgallagher · 2025-06-26T15:48:22Z

nexus/reconfigurator/planning/src/blueprint_editor/sled_editor.rs

+                    );
+                }
+
+                // TODO: Do the same for RoT/SP/host OS.


I think this will be as simple as:

clear any PendingMgsUpdates for this sled

change the host phase 2 in the OmicronSledConfig to the equivalent of InstallDataset for zones (this doesn't exist yet but will be coming soon)

What I'm less sure about is what happens if there are PendingMgsUpdates in the current target blueprint concurrently with a mupdate happening to that sled. Maybe wicket and Nexus end up dueling? If the mupdate completes and changes the contents of any of the target slots Nexus's prechecks should start failing, but if the mupdate happens to not change the target slots, maybe the prechecks still pass and Nexus starts trying to update it again as soon as it comes online?

I haven't done this yet -- worth discussing in the watercooler tomorrow?

jgallagher · 2025-06-26T15:54:41Z

nexus/reconfigurator/planning/src/planner.rs

+            // If do_plan_mupdate_override returns Waiting, we don't plan *any*
+            // additional steps until the system has recovered.
+            self.do_plan_add()?;
+            self.do_plan_decommission()?;


I think we could still decommission things if there's a mupdate override in place? This only acts on sleds or disks that an operator has explicitly told us is gone, and is basically a followup to do_plan_expunge(). (Maybe this step should be ordered before do_sled_add() anyway? I don't think there are any dependencies between them...)

Yep -- done.

jgallagher · 2025-06-26T15:57:47Z

nexus/reconfigurator/planning/src/planner.rs

+                    // generation table -- one of the invariants of the target
+                    // release generation is that it only moves forward.
+                    //
+                    // In this case we warn but set the value.


This doesn't seem right; we should probably bail out of planning entirely in this case, right? This seems like an "I don't know what's going on in the world" kind of thing that in a simpler system we'd assert on?

Yeah -- done.

In #8456 we'll block the `do_plan_add` step on whether the system is currently recovering from a mupdate override. But there's no reason to block the `do_plan_decommission` step on that. This is easiest expressed by moving decommission to before add.

Created using spr 1.3.6-beta.1

) `HostPhase2DesiredContents` is analogous to `OmicronZoneImageSource`, but for OS images: either keep the current contents of the boot disk or set it to a specific artifact from the TUF repo depot. "Keep the current contents" should show up in three cases, just like `OmicronZoneImageSource::InstallDataset`: 1. It's the default value for deserializing, so we can load old configs that didn't have this value 2. RSS uses it (no TUF repo depot involved at this point) 3. The planner will use this variant as a part of removing a mupdate override (this work is still in PR itself: #8456 (comment))

Created using spr 1.3.6-beta.1

sunshowers · 2025-07-10T03:19:37Z

@jgallagher this is ready for you to look at again -- have added clearing the pending MGS update. I'm going to try and land this simultaneously with the sled-agent changes to clear the mupdate override, though, because by itself it will cause the planner to not do anything after the mupdate occurs.

Created using spr 1.3.6-beta.1

jgallagher · 2025-07-11T14:59:29Z

nexus/reconfigurator/planning/src/blueprint_editor/sled_editor.rs

+                    Entry::Occupied(entry) => Some(Box::new(entry.remove())),
+                };
+
+                // TODO: Do the same for host OS.


Could you add a reference to #8542 here? I'm assuming this will land before #8570, and that'll help me find all the spots I need to fixup there.

You were right - #8570 has now landed, so we can fixup the host OS phase 2 contents here.

Added host phase 2 logic.

jgallagher · 2025-07-11T15:02:02Z

nexus/reconfigurator/planning/src/planner.rs

+        }
+
+        // Now we need to determine whether to also perform other actions like
+        // updating or adding zones. We have to be careful here:


This is an excellent comment; thanks!

Created using spr 1.3.6-beta.1

This PR implements logic within sled-agent to clear mupdate overrides. Includes tests, database storage, and displayers. This logic by itself does not introduce behavior changes, since the code to actually set this field is in #8456.

Created using spr 1.3.6-beta.1

sunshowers · 2025-07-25T07:16:55Z

sled-agent/config-reconciler/src/reconciler_task/zones.rs

            OmicronZoneImageSource::Artifact { hash } => {
-                // TODO: implement mupdate override here.
-                //
-                // Search both artifact datasets. This iterator starts with the
-                // dataset for the boot disk (if it exists), and then is followed
-                // by all other disks.
-                let search_paths =
-                    internal_disks.all_artifact_datasets().collect();
-                OmicronZoneFileSource {
-                    // TODO: with mupdate overrides, return InstallDataset here
-                    location: OmicronZoneImageLocation::Artifact {
-                        hash: Ok(*hash),
-                    },
-                    file_source: ZoneImageFileSource {
-                        file_name: hash.to_string(),
-                        search_paths,
-                    },
+                match self.mupdate_override.boot_disk_override.as_ref() {
+                    Ok(Some(_)) => {


This match block implements the logic to honor mupdate overrides.

Created using spr 1.3.6-beta.1

sunshowers · 2025-07-26T00:08:53Z

Testing notes

tl;dr: it works!

Ran this test on berlin:

14    fe80::aa40:25ff:fe04:118  BRM42220023 ci 660a867/7419582 2025-07-25 06:52
15    fe80::aa40:25ff:fe04:219  BRM42220011 ci 660a867/7419582 2025-07-25 06:52
16    fe80::aa40:25ff:fe04:198  BRM42220082 ci 660a867/7419582 2025-07-25 06:52
17    fe80::aa40:25ff:fe04:b5c  BRM06240029 ci 660a867/7419582 2025-07-25 06:52

Did a mupdate on berlin sled 14 (BRM42220023, sled ID 7915e42f-1235-49f2-a156-62d2841c0721) from sled 16

Updated chicken switches:

root@oxz_switch1:~# /opt/oxide/omdb/bin/omdb -w nexus chicken-switches set --add-zones-with-mupdate-override false --planner-enabled true

Ran the blueprint planner, saw the expected results:

root@oxz_switch1:~# /opt/oxide/omdb/bin/omdb nexus blueprints diff target
note: Nexus URL not specified.  Will pick one from DNS.
note: using DNS server for subnet fd00:1122:3344::/48
note: (if this is not right, use --dns-server to specify an alternate DNS server)
note: using Nexus URL http://[fd00:1122:3344:102::4]:12221
from: blueprint de65be9e-e365-42d5-9983-adf048d231f5
to:   blueprint dd04544f-5dc4-4a49-bbe8-486c9e177974

 MODIFIED SLEDS:

  sled 7915e42f-1235-49f2-a156-62d2841c0721 (active, config generation 5 -> 6):
+   will remove mupdate override:   (none) -> a0e97da6-e986-44f7-8cf0-ea4e1f8255be

    host phase 2 contents:
    ------------------------
    slot   boot image source
    ------------------------
    A      current contents
    B      current contents


    physical disks:
    -------------------------------------------------
    vendor   model             serial     disposition
    -------------------------------------------------
    1b96     WUS4C6432DSP3X3   A084A5CA   in service
    1b96     WUS4C6432DSP3X3   A084A5EB   in service
    1b96     WUS4C6432DSP3X3   A084A5FB   in service
    1b96     WUS4C6432DSP3X3   A084A620   in service
    1b96     WUS4C6432DSP3X3   A084A622   in service
    1b96     WUS4C6432DSP3X3   A084A626   in service
    1b96     WUS4C6432DSP3X3   A084A627   in service
    1b96     WUS4C6432DSP3X3   A084A66F   in service
    1b96     WUS4C6432DSP3X3   A084A6BB   in service


    datasets: (snip)

    omicron zones:
    ---------------------------------------------------------------------------------------------------------------
    zone type         zone id                                image source      disposition   underlay IP
    ---------------------------------------------------------------------------------------------------------------
    boundary_ntp      3b1ebf2e-fcfb-48ab-9f87-1aee6c67b991   install dataset   in service    fd00:1122:3344:101::10
    cockroach_db      d48cbd48-cf39-40e2-bcaa-20a9f97a8f67   install dataset   in service    fd00:1122:3344:101::3
    crucible          00103f50-7ea0-4750-96fe-e4c8a759da19   install dataset   in service    fd00:1122:3344:101::8
    crucible          3dd4f260-52f3-447a-a66b-15f7be9628de   install dataset   in service    fd00:1122:3344:101::e
    crucible          44401a73-0b41-42fa-9d4b-3f74ee9f4e86   install dataset   in service    fd00:1122:3344:101::b
    crucible          74636616-485f-4aa0-984a-9c9dcdd41263   install dataset   in service    fd00:1122:3344:101::c
    crucible          89b4bf1c-8057-4800-9b08-674bb2454582   install dataset   in service    fd00:1122:3344:101::f
    crucible          92fef305-b78f-4575-95f7-98bce852975a   install dataset   in service    fd00:1122:3344:101::9
    crucible          a7a0f994-d108-4fa7-a7df-7c751de0f381   install dataset   in service    fd00:1122:3344:101::7
    crucible          d5800f33-34bf-49a3-8c36-d13dba0a574e   install dataset   in service    fd00:1122:3344:101::d
    crucible          f2114fb5-10b7-4133-a0d7-e6c478e638f5   install dataset   in service    fd00:1122:3344:101::a
    crucible_pantry   b7a14593-e7e8-4068-9784-ddcc7c99011a   install dataset   in service    fd00:1122:3344:101::6
    external_dns      fb8bb239-9b62-4444-ad1e-c434f40d71b7   install dataset   in service    fd00:1122:3344:101::4
    internal_dns      6e328b61-56e2-4508-86c4-46ac3220f975   install dataset   in service    fd00:1122:3344:1::1
    oximeter          705c35b6-b301-4e3d-a21b-86020b355d03   install dataset   in service    fd00:1122:3344:101::5


 COCKROACHDB SETTINGS:
    state fingerprint:::::::::::::::::   d4d87aa2ad877a4cc2fddd0573952362739110de (unchanged)
    cluster.preserve_downgrade_option:   "22.1" (unchanged)

 METADATA:
    internal DNS version:::   1 (unchanged)
    external DNS version:::   2 (unchanged)
*   target release min gen:   1 -> 2

 OXIMETER SETTINGS:
    generation:   1 (unchanged)
    read from::   SingleNode (unchanged)

After enabling the blueprint, the mupdate_override.json disappeared from sled 14:

BRM42220023 # ls -al /pool/int/*/install
/pool/int/3e125499-7594-45f1-9ff2-ecc351865d20/install:
total 3029705
drwxr-xr-x   2 root     root          15 Jul 25 19:38 .
drwxr-xr-x   9 root     root           9 Dec 28  1986 ..
-rw-------   1 root     root     323709122 Dec 28  1986 clickhouse.tar.gz
-rw-------   1 root     root     307175785 Dec 28  1986 clickhouse_keeper.tar.gz
-rw-------   1 root     root     323707246 Dec 28  1986 clickhouse_server.tar.gz
-rw-------   1 root     root     163906600 Dec 28  1986 cockroachdb.tar.gz
-rw-------   1 root     root     50036746 Dec 28  1986 crucible.tar.gz
-rw-------   1 root     root     39916305 Dec 28  1986 crucible_pantry.tar.gz
-rw-------   1 root     root     51790813 Dec 28  1986 external_dns.tar.gz
-rw-------   1 root     root     51790413 Dec 28  1986 internal_dns.tar.gz
-rw-------   1 root     root     135484650 Dec 28  1986 nexus.tar.gz
-rw-------   1 root     root     33837625 Dec 28  1986 ntp.tar.gz
-rw-------   1 root     root     64789883 Dec 28  1986 oximeter.tar.gz
-rw-------   1 root     root     3109590 Dec 28  1986 probe.tar.gz
-rw-------   1 root     root        1654 Dec 28  1986 zones.json

/pool/int/8748d5f3-5a17-4f62-9f01-8250bc4b0420/install:
total 3029705
drwxr-xr-x   2 root     root          15 Jul 25 19:38 .
drwxr-xr-x   9 root     root           9 Dec 28  1986 ..
-rw-------   1 root     root     323709122 Dec 28  1986 clickhouse.tar.gz
-rw-------   1 root     root     307175785 Dec 28  1986 clickhouse_keeper.tar.gz
-rw-------   1 root     root     323707246 Dec 28  1986 clickhouse_server.tar.gz
-rw-------   1 root     root     163906600 Dec 28  1986 cockroachdb.tar.gz
-rw-------   1 root     root     50036746 Dec 28  1986 crucible.tar.gz
-rw-------   1 root     root     39916305 Dec 28  1986 crucible_pantry.tar.gz
-rw-------   1 root     root     51790813 Dec 28  1986 external_dns.tar.gz
-rw-------   1 root     root     51790413 Dec 28  1986 internal_dns.tar.gz
-rw-------   1 root     root     135484650 Dec 28  1986 nexus.tar.gz
-rw-------   1 root     root     33837625 Dec 28  1986 ntp.tar.gz
-rw-------   1 root     root     64789883 Dec 28  1986 oximeter.tar.gz
-rw-------   1 root     root     3109590 Dec 28  1986 probe.tar.gz
-rw-------   1 root     root        1654 Dec 28  1986 zones.json

But the mupdate override was not cleared, as expected:

19:49:16.887Z DEBG 875c6434-bf48-4255-83d7-58b5746b323f (ServerContext): running planner with chicken switches
    add_zones_with_mupdate_override = false
    background_task = blueprint_planner
19:49:16.887Z DEBG 875c6434-bf48-4255-83d7-58b5746b323f (ServerContext): no mupdate override action taken, current value left unchanged
    background_task = blueprint_planner
    mupdate_override = None
    phase = do_plan_mupdate_override
    sled_id = 711a6a92-9cc5-4586-9689-e4e3f6588046
19:49:16.887Z INFO 875c6434-bf48-4255-83d7-58b5746b323f (ServerContext): inventory override no longer exists, but blueprint override could not be cleared
    background_task = blueprint_planner
    bp_override = a0e97da6-e986-44f7-8cf0-ea4e1f8255be
    file = nexus/reconfigurator/planning/src/blueprint_builder/builder.rs:2505
    phase = do_plan_mupdate_override
    reason = no sleds can be noop-converted to Artifact: no target release is currently set
    sled_id = 7915e42f-1235-49f2-a156-62d2841c0721

Uploaded the same TUF repo as the target release:

rain@castle $ unzip -p tuf-mupdate.zip repo/metadata/1.root.json | ~/oxide-cli/oxide api -X POST /v1/system/update/trust-roots --input -
rain@castle $ ~/oxide-cli/oxide system update repo upload --path tuf-mupdate.zip
rain@castle /data/local/env/berlin/rain-update-2025-07-25-a $ ~/oxide-cli/oxide experimental system update target-release update --system-version 16.0.0-0.ci+git660a867f523
{
  "generation": 2,
  "release_source": {
    "type": "system_version",
    "version": "16.0.0-0.ci+git660a867f523"
  },
  "time_requested": "2025-07-25T20:20:16.462900Z"
}

A new blueprint was generated:

root@oxz_switch1:~# /opt/oxide/omdb/bin/omdb nexus blueprints diff 34bb3649-e3c3-4f66-b9a1-de6cf5b47069
note: Nexus URL not specified.  Will pick one from DNS.
note: using DNS server for subnet fd00:1122:3344::/48
note: (if this is not right, use --dns-server to specify an alternate DNS server)
note: using Nexus URL http://[fd00:1122:3344:102::4]:12221
from: blueprint dd04544f-5dc4-4a49-bbe8-486c9e177974
to:   blueprint 34bb3649-e3c3-4f66-b9a1-de6cf5b47069

 MODIFIED SLEDS:

  sled 7915e42f-1235-49f2-a156-62d2841c0721 (active, config generation 6 -> 7):
-   will remove mupdate override:   a0e97da6-e986-44f7-8cf0-ea4e1f8255be -> (none)

    host phase 2 contents:
    ------------------------
    slot   boot image source
    ------------------------
    A      current contents
    B      current contents


    physical disks: (snip)


    datasets: (snip)


    omicron zones:
    ----------------------------------------------------------------------------------------------------------------------------------------------
    zone type         zone id                                image source                                     disposition   underlay IP
    ----------------------------------------------------------------------------------------------------------------------------------------------
*   boundary_ntp      3b1ebf2e-fcfb-48ab-9f87-1aee6c67b991   - install dataset                                in service    fd00:1122:3344:101::10
     └─                                                      + artifact: version 16.0.0-0.ci+git660a867f523
*   cockroach_db      d48cbd48-cf39-40e2-bcaa-20a9f97a8f67   - install dataset                                in service    fd00:1122:3344:101::3
     └─                                                      + artifact: version 16.0.0-0.ci+git660a867f523
*   crucible          00103f50-7ea0-4750-96fe-e4c8a759da19   - install dataset                                in service    fd00:1122:3344:101::8
     └─                                                      + artifact: version 16.0.0-0.ci+git660a867f523
*   crucible          3dd4f260-52f3-447a-a66b-15f7be9628de   - install dataset                                in service    fd00:1122:3344:101::e
     └─                                                      + artifact: version 16.0.0-0.ci+git660a867f523
*   crucible          44401a73-0b41-42fa-9d4b-3f74ee9f4e86   - install dataset                                in service    fd00:1122:3344:101::b
     └─                                                      + artifact: version 16.0.0-0.ci+git660a867f523
*   crucible          74636616-485f-4aa0-984a-9c9dcdd41263   - install dataset                                in service    fd00:1122:3344:101::c
     └─                                                      + artifact: version 16.0.0-0.ci+git660a867f523
*   crucible          89b4bf1c-8057-4800-9b08-674bb2454582   - install dataset                                in service    fd00:1122:3344:101::f
     └─                                                      + artifact: version 16.0.0-0.ci+git660a867f523
*   crucible          92fef305-b78f-4575-95f7-98bce852975a   - install dataset                                in service    fd00:1122:3344:101::9
     └─                                                      + artifact: version 16.0.0-0.ci+git660a867f523
*   crucible          a7a0f994-d108-4fa7-a7df-7c751de0f381   - install dataset                                in service    fd00:1122:3344:101::7
     └─                                                      + artifact: version 16.0.0-0.ci+git660a867f523
*   crucible          d5800f33-34bf-49a3-8c36-d13dba0a574e   - install dataset                                in service    fd00:1122:3344:101::d
     └─                                                      + artifact: version 16.0.0-0.ci+git660a867f523
*   crucible          f2114fb5-10b7-4133-a0d7-e6c478e638f5   - install dataset                                in service    fd00:1122:3344:101::a
     └─                                                      + artifact: version 16.0.0-0.ci+git660a867f523
*   crucible_pantry   b7a14593-e7e8-4068-9784-ddcc7c99011a   - install dataset                                in service    fd00:1122:3344:101::6
     └─                                                      + artifact: version 16.0.0-0.ci+git660a867f523
*   external_dns      fb8bb239-9b62-4444-ad1e-c434f40d71b7   - install dataset                                in service    fd00:1122:3344:101::4
     └─                                                      + artifact: version 16.0.0-0.ci+git660a867f523
*   internal_dns      6e328b61-56e2-4508-86c4-46ac3220f975   - install dataset                                in service    fd00:1122:3344:1::1
     └─                                                      + artifact: version 16.0.0-0.ci+git660a867f523
*   oximeter          705c35b6-b301-4e3d-a21b-86020b355d03   - install dataset                                in service    fd00:1122:3344:101::5
     └─                                                      + artifact: version 16.0.0-0.ci+git660a867f523


 COCKROACHDB SETTINGS:
    state fingerprint:::::::::::::::::   d4d87aa2ad877a4cc2fddd0573952362739110de (unchanged)
    cluster.preserve_downgrade_option:   "22.1" (unchanged)

 METADATA:
    internal DNS version:::   1 (unchanged)
    external DNS version:::   2 (unchanged)
    target release min gen:   2 (unchanged)

 OXIMETER SETTINGS:
    generation:   1 (unchanged)
    read from::   SingleNode (unchanged)

 PENDING MGS UPDATES:

    Pending MGS-managed updates (all baseboards):
    -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
    sp_type   slot   part_number   serial_number   artifact_hash                                                      artifact_version   details                                                
    -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
+   power     0      913-0000003   BRM11230009     fe479e8c50a8faf884c61dc71ab697fd13f2b6c934e2d65c7159de4bf21fbef5   1.0.43             Sp { expected_active_version: ArtifactVersion("1.0.42"), expected_inactive_version: Version(ArtifactVersion("1.0.36")) }

And then a number of blueprints were generated afterwards. (Along the way we found that Sled Agent had a stale zone manifest cache due to the way rkadm works -- not an issue in production, but filed as https://github.com/oxidecomputer/rackletteadm/issues/70.)

Disabled the planner:

root@oxz_switch1:~# /opt/oxide/omdb/bin/omdb -w nexus chicken-switches set --planner-enabled false

Next, mupdated sled 14 to a different commit, then when sled-agent came up, ensured that the mupdate override was honored:

23:30:00.728Z DEBG SledAgent (ConfigReconcilerTask): mupdate override active, redirecting Artifact source to InstallDataset
    artifact_hash = 6114e663c9eb8f3372f93f2557f7ca049f47b21407b89fcc8e93b7d55a9f7469
    install_dataset_hash = Ok(ArtifactHash("7ec5a86fde80aa73dc8887310658fae1beb023e4f549791b1134b8a63f22a1a4"))
    mupdate_override_id = d7ad8120-a82e-482b-8204-b8f683e3ad3e
    search_paths = ["/pool/int/8748d5f3-5a17-4f62-9f01-8250bc4b0420/install"]
23:30:00.728Z DEBG SledAgent (ConfigReconcilerTask): obtained file source for zone, going to start it
    file_source = OmicronZoneFileSource { location: InstallDataset { hash: Ok(ArtifactHash("7ec5a86fde80aa73dc8887310658fae1beb023e4f549791b1134b8a63f22a1a4")) }, file_source: ZoneImageFileSource { file_name: "crucible.tar.gz", search_paths: ["/pool/int/8748d5f3-5a17-4f62-9f01-8250bc4b0420/install"] } }
    zone_name = oxz_crucible_d5800f33-34bf-49a3-8c36-d13dba0a574e

Two blueprints were generated, first:

  sled 7915e42f-1235-49f2-a156-62d2841c0721 (active, config generation 7 -> 8):
+   will remove mupdate override:   (none) -> d7ad8120-a82e-482b-8204-b8f683e3ad3e

    omicron zones:
    ----------------------------------------------------------------------------------------------------------------------------------------------
    zone type         zone id                                image source                                     disposition   underlay IP
    ----------------------------------------------------------------------------------------------------------------------------------------------
*   boundary_ntp      3b1ebf2e-fcfb-48ab-9f87-1aee6c67b991   - artifact: version 16.0.0-0.ci+git660a867f523   in service    fd00:1122:3344:101::10
     └─                                                      + install dataset
*   cockroach_db      d48cbd48-cf39-40e2-bcaa-20a9f97a8f67   - artifact: version 16.0.0-0.ci+git660a867f523   in service    fd00:1122:3344:101::3
     └─                                                      + install dataset
*   crucible          00103f50-7ea0-4750-96fe-e4c8a759da19   - artifact: version 16.0.0-0.ci+git660a867f523   in service    fd00:1122:3344:101::8
     └─                                                      + install dataset
*   crucible          3dd4f260-52f3-447a-a66b-15f7be9628de   - artifact: version 16.0.0-0.ci+git660a867f523   in service    fd00:1122:3344:101::e
     └─                                                      + install dataset
*   crucible          44401a73-0b41-42fa-9d4b-3f74ee9f4e86   - artifact: version 16.0.0-0.ci+git660a867f523   in service    fd00:1122:3344:101::b
     └─                                                      + install dataset
*   crucible          74636616-485f-4aa0-984a-9c9dcdd41263   - artifact: version 16.0.0-0.ci+git660a867f523   in service    fd00:1122:3344:101::c
     └─                                                      + install dataset
*   crucible          89b4bf1c-8057-4800-9b08-674bb2454582   - artifact: version 16.0.0-0.ci+git660a867f523   in service    fd00:1122:3344:101::f
     └─                                                      + install dataset
*   crucible          92fef305-b78f-4575-95f7-98bce852975a   - artifact: version 16.0.0-0.ci+git660a867f523   in service    fd00:1122:3344:101::9
     └─                                                      + install dataset
*   crucible          a7a0f994-d108-4fa7-a7df-7c751de0f381   - artifact: version 16.0.0-0.ci+git660a867f523   in service    fd00:1122:3344:101::7
     └─                                                      + install dataset
*   crucible          d5800f33-34bf-49a3-8c36-d13dba0a574e   - artifact: version 16.0.0-0.ci+git660a867f523   in service    fd00:1122:3344:101::d
     └─                                                      + install dataset
*   crucible          f2114fb5-10b7-4133-a0d7-e6c478e638f5   - artifact: version 16.0.0-0.ci+git660a867f523   in service    fd00:1122:3344:101::a
     └─                                                      + install dataset
*   crucible_pantry   b7a14593-e7e8-4068-9784-ddcc7c99011a   - artifact: version 16.0.0-0.ci+git660a867f523   in service    fd00:1122:3344:101::6
     └─                                                      + install dataset
*   external_dns      fb8bb239-9b62-4444-ad1e-c434f40d71b7   - artifact: version 16.0.0-0.ci+git660a867f523   in service    fd00:1122:3344:101::4
     └─                                                      + install dataset
*   internal_dns      6e328b61-56e2-4508-86c4-46ac3220f975   - artifact: version 16.0.0-0.ci+git660a867f523   in service    fd00:1122:3344:1::1
     └─                                                      + install dataset
*   oximeter          705c35b6-b301-4e3d-a21b-86020b355d03   - artifact: version 16.0.0-0.ci+git660a867f523   in service    fd00:1122:3344:101::5
     └─                                                      + install dataset

 METADATA:
*   internal DNS version:::   2 -> 3
    external DNS version:::   2 (unchanged)
*   target release min gen:   2 -> 3

Then, after sled-agent cleared the mupdate override, saw another blueprint:

  sled 7915e42f-1235-49f2-a156-62d2841c0721 (active, config generation 8 -> 9):
-   will remove mupdate override:   d7ad8120-a82e-482b-8204-b8f683e3ad3e -> (none)

Uploaded this repo, and then saw noop zone image switches:

    omicron zones:
    ----------------------------------------------------------------------------------------------------------------------------------------------
    zone type         zone id                                image source                                     disposition   underlay IP
    ----------------------------------------------------------------------------------------------------------------------------------------------
*   boundary_ntp      3b1ebf2e-fcfb-48ab-9f87-1aee6c67b991   - install dataset                                in service    fd00:1122:3344:101::10
     └─                                                      + artifact: version 16.0.0-0.ci+gita6a0e77e9d5
*   cockroach_db      d48cbd48-cf39-40e2-bcaa-20a9f97a8f67   - install dataset                                in service    fd00:1122:3344:101::3
     └─                                                      + artifact: version 16.0.0-0.ci+gita6a0e77e9d5
*   crucible          00103f50-7ea0-4750-96fe-e4c8a759da19   - install dataset                                in service    fd00:1122:3344:101::8
     └─                                                      + artifact: version 16.0.0-0.ci+gita6a0e77e9d5
*   crucible          3dd4f260-52f3-447a-a66b-15f7be9628de   - install dataset                                in service    fd00:1122:3344:101::e
     └─                                                      + artifact: version 16.0.0-0.ci+gita6a0e77e9d5
*   crucible          44401a73-0b41-42fa-9d4b-3f74ee9f4e86   - install dataset                                in service    fd00:1122:3344:101::b
     └─                                                      + artifact: version 16.0.0-0.ci+gita6a0e77e9d5
*   crucible          74636616-485f-4aa0-984a-9c9dcdd41263   - install dataset                                in service    fd00:1122:3344:101::c
     └─                                                      + artifact: version 16.0.0-0.ci+gita6a0e77e9d5
*   crucible          89b4bf1c-8057-4800-9b08-674bb2454582   - install dataset                                in service    fd00:1122:3344:101::f
     └─                                                      + artifact: version 16.0.0-0.ci+gita6a0e77e9d5
*   crucible          92fef305-b78f-4575-95f7-98bce852975a   - install dataset                                in service    fd00:1122:3344:101::9
     └─                                                      + artifact: version 16.0.0-0.ci+gita6a0e77e9d5
*   crucible          a7a0f994-d108-4fa7-a7df-7c751de0f381   - install dataset                                in service    fd00:1122:3344:101::7
     └─                                                      + artifact: version 16.0.0-0.ci+gita6a0e77e9d5
*   crucible          d5800f33-34bf-49a3-8c36-d13dba0a574e   - install dataset                                in service    fd00:1122:3344:101::d
     └─                                                      + artifact: version 16.0.0-0.ci+gita6a0e77e9d5
*   crucible          f2114fb5-10b7-4133-a0d7-e6c478e638f5   - install dataset                                in service    fd00:1122:3344:101::a
     └─                                                      + artifact: version 16.0.0-0.ci+gita6a0e77e9d5
*   crucible_pantry   b7a14593-e7e8-4068-9784-ddcc7c99011a   - install dataset                                in service    fd00:1122:3344:101::6
     └─                                                      + artifact: version 16.0.0-0.ci+gita6a0e77e9d5
*   external_dns      fb8bb239-9b62-4444-ad1e-c434f40d71b7   - install dataset                                in service    fd00:1122:3344:101::4
     └─                                                      + artifact: version 16.0.0-0.ci+gita6a0e77e9d5
*   internal_dns      6e328b61-56e2-4508-86c4-46ac3220f975   - install dataset                                in service    fd00:1122:3344:1::1
     └─                                                      + artifact: version 16.0.0-0.ci+gita6a0e77e9d5
*   oximeter          705c35b6-b301-4e3d-a21b-86020b355d03   - install dataset                                in service    fd00:1122:3344:101::5
     └─                                                      + artifact: version 16.0.0-0.ci+gita6a0e77e9d5

Created using spr 1.3.6-beta.1

sunshowers · 2025-07-29T03:11:18Z

sled-agent/config-reconciler/src/mupdate_override.rs

@@ -68,6 +70,32 @@ impl ResolverStatusExt for ResolverStatus {
    ) -> OmicronZoneFileSource {
        match image_source {
            OmicronZoneImageSource::InstallDataset => {
+                match &self.mupdate_override.boot_disk_override {


Probably worth one more look at the new code here, particularly at this file.

Created using spr 1.3.6-beta.1

…8688) This PR prepares the sled-agent config reconciler to honor mupdate overrides. In particular, the logic that requires bouncing zones needs to be updated to consider zone image locations after mupdate overrides have been considered, not before. Doing this required some refactoring. The biggest one is that looking up zone image sources is no longer done through the image resolver and/or within sled-agent's `services.rs`, but rather by gathering and querying the resolver status within the config reconciler. This is a nice improvement overall, particularly because it means we grab the lock once per reconciler run rather than each time we need to look up a zone's image source. We do not actually honor the mupdate override yet, though all the pieces are now in place. We'll combine the PRs to honor the override and update the blueprint logic into one, within #8456.

jgallagher · 2025-07-29T18:40:00Z

nexus/reconfigurator/planning/src/planner.rs

+        //    matching up the zone manifest with the target release to compute
+        //    the number of versions running at a given time), but that's a
+        //    non-trivial optimization that we should probably defer until we
+        //    see its necessity.


I'm not proposing we put this in this PR, but just want to confirm: I think the thing we've been talking about where the planner shouldn't proceed with an upgrade until all zones have been converted back to non-InstallDataset-sources could be a third condition here, right? In particular, step 2 of this comment uses the phrase "recovering from a MUPdate", and any zone still configured to run out of an install dataset means we're still in that state.

Do you think there's more involved (other than testing ofc) than adding a third check at this point?

Yeah, I think you're right -- the paragraph starting at "There's some potential to relax this in the future" actually hints at this third condition, and we should make this explicit in the PR.

jgallagher · 2025-07-29T18:41:45Z

nexus/reconfigurator/planning/src/blueprint_editor/sled_editor/host_phase_2.rs

+        let previous = BlueprintHostPhase2DesiredSlots {
+            slot_a: self.slot_a.value().clone(),
+            slot_b: self.slot_b.value().clone(),
+        };
        self.slot_a.set_value(slot_a);


Should ScalarEditor::set_value() return the old value? (If it did, would that clean up this method a bit?)

Yeah I'd considered that, I think the issue I ran into was that the value could be owned or borrowed so we'd have to return Cow<'a, T>. But maybe that's okay.

Ehh, I don't know if it's worth it. Up to you.

jgallagher · 2025-07-29T18:42:38Z

nexus/reconfigurator/planning/src/blueprint_editor/sled_editor.rs

+                            // Some other reason -- sled remains ineligible.
+                        }
+                        NoopConvertSledStatus::Eligible(eligible) => {
+                            // Transition to Eligible with the new override.


Suggested change

// Transition to Eligible with the new override.

// Transition to Ineligible with the new override.

Thanks, fixed.

jgallagher · 2025-07-29T18:43:46Z

nexus/reconfigurator/planning/src/blueprint_editor/sled_editor.rs

+                        }
+                        NoopConvertSledStatus::Eligible(eligible) => {
+                            // Transition to Eligible with the new override.
+                            let zones = mem::replace(


Thoughts on mem::take() instead of mem::replace()? I can never decide whether it feels too magical.

In this case, yeah, mem::take makes sense -- we're replacing it immediately anyway.

jgallagher · 2025-07-29T18:47:50Z

nexus/reconfigurator/planning/src/blueprint_editor/sled_editor.rs

+                // However, the blueprint's remove_mupdate_override remains in
+                // place until all zones' image sources can be noop-converted to
+                // Artifact. We do this to minimize the number of different
+                // versions of software that exist.


Is this bit of the comment right? I think the TODO down on line 918 means it's not true yet.

I'm not sure this is where I'd implement this bit (see my earlier comment about adding a third check to whether it's okay to proceed to subsequent steps).

Removed this comment. Also agreed that a third top-level check makes more sense.

jgallagher · 2025-07-29T18:55:22Z

dev-tools/reconfigurator-cli/tests/output/cmds-mupdate-update-flow-stdout

@@ -64,14 +85,18 @@ set sled 98e6b7c2-2efa-41ca-b20a-0a4d61102fe6 mupdate override: unset -> 6123eac
 set sled 2b8f0cb3-0295-4b3c-bc58-4fe88b57112c mupdate override: unset -> error


+> # Also set its SP update, which will not be cleared.


Should this be cleared if there's a mupdate override error? IIRC a mupdate override error means sled-agent won't launch any zones or write new host OS images; should Nexus keep trying to update its SP?

I've thought of Sled Agent and SP as relatively independent domains, so I'm not sure. Discuss in today's update watercooler?

Sure. I think in practice this doesn't really matter: a mupdate and a Nexus-driven MGS update are going to interfere with each other without support involvement in some way anyway, and after performing a mupdate all of the prechecks that a pending MGS update performs will fail, so Nexus won't try to do anything.

However, since we wouldn't add a new pending MGS update until we recovered from the sled mupdate, it seems a little cleaner to clear any existing ones?

jgallagher · 2025-07-29T18:56:37Z

dev-tools/reconfigurator-cli/tests/output/cmds-mupdate-update-flow-stdout

+> # remove_mupdate_override to be unset. But no further planning steps will
+> # happen because the target release generation is not new enough.
+> #
+> # TODO: we want to block remove_mupdate_override unsets until the


Same note as above; wondering if it's fine to keep this as-is and only change the planner's "can we proceed" check.

Yeah, reworded the TODO.

[spr] initial version

ba8676a

Created using spr 1.3.6-beta.1

sunshowers marked this pull request as draft June 26, 2025 05:26

clippy

97e3c59

Created using spr 1.3.6-beta.1

sunshowers commented Jun 26, 2025

View reviewed changes

jgallagher reviewed Jun 26, 2025

View reviewed changes

plotnick mentioned this pull request Jun 26, 2025

Planner wait conditions #8453

Closed

This was referenced Jun 26, 2025

sled-agent should not bounce zones if the image source changes but their hashes match #8463

Closed

[20/n] [reconfigurator-planning] do decommission before add #8464

Merged

sunshowers changed the title ~~[wip] [20/n] blueprint planner logic for mupdate overrides~~ [wip] [23/n] blueprint planner logic for mupdate overrides Jun 27, 2025

sunshowers changed the title ~~[wip] [23/n] blueprint planner logic for mupdate overrides~~ [wip] [??/n] blueprint planner logic for mupdate overrides Jul 1, 2025

rebase, mostly ready for review

3638cf8

Created using spr 1.3.6-beta.1

sunshowers marked this pull request as ready for review July 8, 2025 05:15

sunshowers changed the title ~~[wip] [??/n] blueprint planner logic for mupdate overrides~~ [28/n] blueprint planner logic for mupdate overrides Jul 8, 2025

jgallagher mentioned this pull request Jul 8, 2025

Add desired host phase 2 contents to OmicronSledConfig (PR 1/4) #8538

Merged

rebase on 28, comments, pending MGS updates

99668b1

Created using spr 1.3.6-beta.1

sunshowers added 2 commits July 10, 2025 03:30

update comment

e506a84

Created using spr 1.3.6-beta.1

clippy

ca33786

Created using spr 1.3.6-beta.1

sunshowers changed the title ~~[28/n] blueprint planner logic for mupdate overrides~~ [29/n] blueprint planner logic for mupdate overrides Jul 10, 2025

sunshowers requested a review from andrewjstone July 10, 2025 22:49

jgallagher approved these changes Jul 11, 2025

View reviewed changes

sunshowers changed the title ~~[29/n] blueprint planner logic for mupdate overrides~~ [30/n] blueprint planner logic for mupdate overrides Jul 15, 2025

sunshowers added 2 commits July 15, 2025 05:43

updates

de1f2c0

Created using spr 1.3.6-beta.1

rebase on main

e6f7978

Created using spr 1.3.6-beta.1

sunshowers mentioned this pull request Jul 17, 2025

[34/n] sled-agent logic to clear mupdate overrides #8572

Merged

sunshowers changed the title ~~[30/n] blueprint planner logic for mupdate overrides~~ [??/n] blueprint planner logic for mupdate overrides Jul 17, 2025

plotnick mentioned this pull request Jul 17, 2025

Planning reports #8631

Draft

5 tasks

sunshowers changed the title ~~[??/n] blueprint planner logic for mupdate overrides~~ [??/n] blueprint planner logic + sled agent code to honor mupdate overrides Jul 25, 2025

rebase on 39

0f42970

Created using spr 1.3.6-beta.1

sunshowers changed the title ~~[??/n] blueprint planner logic + sled agent code to honor mupdate overrides~~ [40/n] blueprint planner logic + sled agent code to honor mupdate overrides Jul 25, 2025

sunshowers mentioned this pull request Jul 25, 2025

[39/n] prepare sled-agent reconciler for honoring mupdate overrides #8688

Merged

sunshowers added 2 commits July 25, 2025 06:32

fix logic

d763df1

Created using spr 1.3.6-beta.1

use debug rather than info to reduce reconfigurator-cli noise

660a867

Created using spr 1.3.6-beta.1

sunshowers commented Jul 25, 2025

View reviewed changes

rebase

a6a0e77

Created using spr 1.3.6-beta.1

rebase

26f5626

Created using spr 1.3.6-beta.1

sunshowers commented Jul 29, 2025

View reviewed changes

add chicken switch tests

2f086af

Created using spr 1.3.6-beta.1

jgallagher reviewed Jul 29, 2025

View reviewed changes

davepacheco assigned sunshowers Jul 29, 2025

sunshowers mentioned this pull request Jul 29, 2025

mupdate/update recovery flow should ensure that all deployment units are at known versions before proceeding with other operations #8726

Open

		let mut sleds_with_override = BTreeSet::new();
		for sled_id in self.input.all_sled_ids(SledFilter::InService) {

	// Transition to Eligible with the new override.
	// Transition to Ineligible with the new override.

		@@ -64,14 +85,18 @@ set sled 98e6b7c2-2efa-41ca-b20a-0a4d61102fe6 mupdate override: unset -> 6123eac
		set sled 2b8f0cb3-0295-4b3c-bc58-4fe88b57112c mupdate override: unset -> error


		> # Also set its SP update, which will not be cleared.

[40/n] blueprint planner logic + sled agent code to honor mupdate overrides #8456

Are you sure you want to change the base?

[40/n] blueprint planner logic + sled agent code to honor mupdate overrides #8456

Conversation

sunshowers commented Jun 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sunshowers Jun 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sunshowers Jun 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sunshowers Jun 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sunshowers commented Jul 10, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sunshowers commented Jul 26, 2025

Testing notes

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sunshowers Jul 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sunshowers commented Jun 26, 2025 •

edited

Loading

sunshowers Jun 26, 2025 •

edited

Loading

sunshowers Jun 26, 2025 •

edited

Loading

sunshowers Jun 26, 2025 •

edited

Loading

sunshowers Jul 29, 2025 •

edited

Loading