Skip to content

Add host phase 1 flash hashes to inventory #8624

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 15 commits into
base: main
Choose a base branch
from

Conversation

jgallagher
Copy link
Contributor

Reconfigurator needs to know the current contents of the host phase 1 hash slots for OS upgrades; this change adds those hashes to inventory. It uses the new MGS endpoints landed in #8593.

This has the side effect of making inventory collections that use sp-sim noticeably slower (by a handful of seconds), because sp-sim tries to act like a real SP and take a moment to actually compute the hashes. This slowdown identified a handful of flaky tests that appear to be assuming #[nexus_test] would guarantee there's been an inventory collection, but it doesn't - inventory just used to be fast enough that it happened to be true. To fix these, I added a helper method to wait for an inventory collection, and call it from the tests that shook out as flaky. I'm a little worried there are others that CI hasn't caught yet, but hopefully if so it'll be an easy fix for them when they pop up.

As written this will conflict with #8613; I'd prefer to let that (and #8617 and #8618) land first, then I can update this PR to show the new fields in the inventory output. But I think that change will be small, and this can be reviewed as-is.

Copy link
Contributor

@andrewjstone andrewjstone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great. Just some minor comments/questions.

hash: *p.hash,
},
);
bail_unless!(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean that if there is ever a duplicate we don't ever return inventory for this collection? Is this essentially an assert, in that we don't expect it to ever happen, but we also don't want to crash nexus if it does?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes and yes; we have a bunch of these through inventory insertion that are effectively "assert-but-don't-crash" that some internal invariant is satisfied.


for (baseboard_id, phase1) in phase1_hashes {
let selection = nexus_db_schema::schema::hw_baseboard_id::table
.select((
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this return fields that we pass in as part of the selection, but only filters on the hw_baseboard_table for the given columns? I ask because, most of these columns don't exist on hw_baseboard_id making this a touch confusing to me as a relative SQL noob. It's not making me a bigger fan of SQL ;)

Copy link
Contributor Author

@jgallagher jgallagher Jul 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah; FWIW this is also basically "copy and paste and change the columns" from where @davepacheco originally did this work; I think it'd be very straightforward in SQL but is pretty obscured by diesel. This turns into something like, this, where we fill in all the $column values directly except the foreign key into hw_baseboard_id, which we look up via SELECT:

INSERT INTO inv_host_phase_1_flash_hash VALUES (
    SELECT $collection_id, hw_baseboard_id.id, $time_collected, $source, $slot, $hash
    FROM hw_baseboard_id
    WHERE part_number = $part_number AND serial_number = $serial_number
)

Copy link
Contributor

@karencfv karencfv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Just a couple of minor nits

Comment on lines 162 to 166
impl_enum_type!(
HwHostPhase1SlotEnum:

#[derive(Copy, Clone, Debug, AsExpression, FromSqlRow, PartialEq)]
pub enum HwHostPhase1Slot;
Copy link
Contributor

@sunshowers sunshowers Jul 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thoughts on just calling it something like M2Slot so the enum can be used everywhere we need something like this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry if this is rambly - thinking out loud. I definitely want to use something like this in other places (even on this PR!). I worry slightly about conflating two strongly-related-but-slightly-different things:

  1. The SP exposes exactly 2 host phase 1 slots.
  2. gimlet and cosmo have two M.2 drives, each of which is directly associated with one of the two host phase 1 slots

Things I have vague worry might change in the future:

  1. Will future sleds keep using the M.2 form factor? (This is just a naming concern, but it's why in sled-agent I'd started to use "internal disk" instead of "M.2". Maybe this is overthinking it.)
  2. The M.2 drives have multiple partitions, and even multiple boot partitions. sled-agent currently treats partition 0 as the boot partition and partitions 1, 2, and 3 as reserved, but that doesn't match reality: partition 1 is also a boot partition, and pilot knows how to write to it. (We use this as a backup in manufacturing sometimes; if something gets screwed up in partition 0, we can still boot if there's a valid OS image in partition 1.) sled-agent eventually needs to learn what's going on here (or we need to change the OS behavior); at the moment we could theoretically boot out of partition 1 instead of 0 and be very confused.

Point 2 is a really long-winded way of saying "I'm not sure M2Slot is sufficient to represent the sled-agent side of this", and we might want to expand that to something that holds both the M2Slot and which partition it was. But maybe even still we keep using this enum for both phase 1 and "which disk is it"?

I think I'm coming around to "just call this M2Slot" since that's what we use everywhere else. If we need to change it in the future, we'll have a better idea then of what to rename it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed to HwM2Slot in bdbd14d

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants