-
Notifications
You must be signed in to change notification settings - Fork 203
Pass measurement token between RoT and SP #2138
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
/// | ||
/// # Changelog | ||
/// Version 10 requires Humility to be aware of the `handoff` kernel feature, | ||
/// which lets the RoT inform the SP when measurements have been taken. If | ||
/// Humility is unaware of this feature, the SP will reset itself repeatedly, | ||
/// which interferes with subsequent programming of auxiliary flash. | ||
const HUBRIS_ARCHIVE_VERSION: u32 = 10; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have a draft of the humility changes necessary ? Manufacturing relies on humility flash --check
, humility flash --verify
and humility tasks idle
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it's over in mkeeter/measurement-token-handoff
.
I'm wondering whether we should move the address + magic numbers into a common repo, e.g. lpc55-areas
, to avoid completely magic values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea of sharing the values between Hubris and Humility softwarily rather than by copy-paste is certainly appealing to me in theory...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well you're in luck: oxidecomputer/lpc55_support#94
if self.dp_write_bitflags::<Dhcsr>(Dhcsr::resume()).is_err() { | ||
// If the write fails, then attempt to undo it | ||
ringbuf_entry!(Trace::DhcsrWriteError); | ||
self.disable_halting_debug(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm lightly concerned about an influx of errors here and elsewhere. If we get an error for the debug port once I think there's a pretty good chance we won't be able to do much of anything until we reset it so our attempts to clean up would probably just generate errors too. Maybe we're okay with that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm, I'm not sure if there are more errors here than before. The main refactoring to look at is f862b8c, which actually reduced potentially logged errors – it removes the Undo
object, so Trace::Never
can't happen any more, because it represented a state where the Undo
had invalid bits set.
drv/stm32h7-startup/src/lib.rs
Outdated
measurement_token::check(20, || { | ||
cortex_m::asm::delay(12860000); // about 200 ms | ||
cortex_m::peripheral::SCB::sys_reset() | ||
}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should make sure these programmings worth for both the selfsigned
RoT images as well as the officially signed images (testing on staging should be fine for this).
lib/measurement-token/src/lib.rs
Outdated
let token = core::ptr::read_volatile(ptr); | ||
let tag = core::ptr::read_volatile(ptr.wrapping_add(1)); | ||
let counter = core::ptr::read_volatile(ptr.wrapping_add(2)); | ||
let check = core::ptr::read_volatile(ptr.wrapping_add(3)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm having a very [everyone disliked that]
moment reading this but based on our discussion I know why we need it.
Is there any way to turn these into a struct for use here and then write our changes back when we're done?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I experimented with a HandoffStruct
that contains the four words, and got confusing results. The code seems to work, but the startup delay is 50 seconds instead of 2.5 seconds, which doesn't make sense.
Even weirder, if I tried to isolate the function with #[inline(never)]
on measurement_token::check
, we went back to ~2.5 seconds.
I'm wondering if there's shenanigans going on with read_volatile
on a struct
somehow interfering with the delay loop...
Here's the diff that I tested (without the inlining change)
Commit ID: 2fcbccab7086327c3b15c1f684670dad4b3f6dfd
Change ID: ulpzryyqzytnkmtqptvqsprvksltnsys
Author : Matt Keeter <[email protected]> (2025-07-10 09:53:13)
Committer: Matt Keeter <[email protected]> (2025-07-10 10:11:09)
Better token writing
diff --git a/lib/measurement-token/src/lib.rs b/lib/measurement-token/src/lib.rs
index b3f9c846b8..936700fbfb 100644
--- a/lib/measurement-token/src/lib.rs
+++ b/lib/measurement-token/src/lib.rs
@@ -47,6 +47,49 @@
static mut __REGION_HANDOFF_END: [u8; 0];
}
+#[derive(Copy, Clone)]
+#[repr(C)]
+struct HandoffData {
+ token: u32,
+ tag: u32,
+ raw_counter: u32,
+ check: u32,
+}
+
+impl HandoffData {
+ /// Builds a new handoff object (with no token) from a counter value
+ fn from_counter(c: u32) -> Self {
+ HandoffData {
+ token: 0,
+ tag: COUNTER_TAG,
+ raw_counter: c,
+ check: COUNTER_TAG ^ c,
+ }
+ }
+
+ /// Returns a handoff object with the token and counter tag fields erased
+ ///
+ /// This will not be treated as a measurement token, and has an invalid
+ /// counter, but the actual raw count is preserved in RAM for debug
+ fn destroy(self) -> Self {
+ HandoffData {
+ token: 0,
+ tag: 0,
+ ..self
+ }
+ }
+
+ /// Returns the counter value if it is valid
+ fn counter(self) -> Option<u32> {
+ if self.tag == COUNTER_TAG && self.tag ^ self.raw_counter == self.check
+ {
+ Some(self.raw_counter)
+ } else {
+ None
+ }
+ }
+}
+
/// Check the measurement token, calling `reset_fn` to reset if needed
///
/// Calls `delay_and_reset` (which diverges) if no measurement is present and we
@@ -55,41 +98,37 @@
///
/// `delay_and_reset` should include a delay, to give the RoT time to boot.
pub unsafe fn check(retry_count: u32, delay_and_reset: fn() -> !) -> bool {
- let ptr: *mut u32 = &raw mut __REGION_HANDOFF_BASE as *mut _;
- let end: *mut u32 = &raw mut __REGION_HANDOFF_END as *mut _;
+ let ptr: *mut HandoffData = &raw mut __REGION_HANDOFF_BASE as *mut _;
+ let end: *mut HandoffData = &raw mut __REGION_HANDOFF_END as *mut _;
assert!(ptr == MEASUREMENT_BASE as *mut _);
assert!(end.offset_from(ptr) >= 4 * core::mem::size_of::<u32>() as isize);
- let token = core::ptr::read_volatile(ptr);
- let tag = core::ptr::read_volatile(ptr.wrapping_add(1));
- let counter = core::ptr::read_volatile(ptr.wrapping_add(2));
- let check = core::ptr::read_volatile(ptr.wrapping_add(3));
+ let t = core::ptr::read_volatile(ptr);
- let out = if token == MEASUREMENT_TOKEN_VALID {
+ let out = if t.token == MEASUREMENT_TOKEN_VALID {
Ok(true) // told that measurement was completed
- } else if token == MEASUREMENT_TOKEN_SKIP {
+ } else if t.token == MEASUREMENT_TOKEN_SKIP {
Ok(false) // told to skip measuring
- } else if tag != COUNTER_TAG || tag ^ counter != check {
+ } else if let Some(c) = t.counter() {
+ if c >= retry_count {
+ Ok(false) // exceeded retry count, so boot
+ } else {
+ Err(c) // we should reset the processor
+ }
+ } else {
Err(0) // no counter, so initialize it
- } else if counter >= retry_count {
- Ok(false) // exceeded retry count, so boot
- } else {
- Err(counter) // we should reset the processor
};
match out {
Ok(v) => {
- // Destroy the existing token
- core::ptr::write_volatile(ptr, 0);
- core::ptr::write_volatile(ptr.wrapping_add(1), 0);
+ core::ptr::write_volatile(ptr, t.destroy());
v
}
Err(counter) => {
- // Increment the counter, then reset
- let next = counter + 1;
- core::ptr::write_volatile(ptr.wrapping_add(1), COUNTER_TAG);
- core::ptr::write_volatile(ptr.wrapping_add(2), next);
- core::ptr::write_volatile(ptr.wrapping_add(3), next ^ COUNTER_TAG);
+ core::ptr::write_volatile(
+ ptr,
+ HandoffData::from_counter(counter + 1),
+ );
delay_and_reset();
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea that this diff changes anything is pretty weird..perhaps we could construct something in godbolt to simulate this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I'm kinda baffled. I did some disassembling, and the delay loop looks identical.
Here's the normal code, which increments the counter, then delays and resets:
08004328 <measurement_token::check>:
.. elided ..
8004366: 1c59 adds r1, r3, #1
8004368: 6042 str r2, [r0, #4]
800436a: 6081 str r1, [r0, #8]
800436c: 4051 eors r1, r2
800436e: 60c1 str r1, [r0, #12]
8004370: f7fd f990 bl 8001694 <core::ops::function::FnOnce::call_once>
08001694 <core::ops::function::FnOnce::call_once>:
8001694: b580 push {r7, lr}
8001696: 466f mov r7, sp
8001698: f000 f800 bl 800169c <drv_stm32h7_startup::system_init_custom::{{closure}}>
0800169c <drv_stm32h7_startup::system_init_custom::{{closure}}>:
800169c: b580 push {r7, lr}
800169e: 466f mov r7, sp
80016a0: 4802 ldr r0, [pc, #8] ; (80016ac <drv_stm32h7_startup::system_init_custom::{{closure}}+0x10>)
80016a2: 3801 subs r0, #1
80016a4: d1fd bne.n 80016a2 <drv_stm32h7_startup::system_init_custom::{{closure}}+0x6>
80016a6: f000 f803 bl 80016b0 <cortex_m::peripheral::scb::<impl cortex_m::peripheral::SCB>::sys_reset>
80016aa: bf00 nop
80016ac: 00621d31 .word 0x00621d31
... and here's the problematic version
080004f0 <grapefruit::system_init>:
.. many things elided, this is all inlined ..
8000816: 9803 ldr r0, [sp, #12]
8000818: 6008 str r0, [r1, #0]
800081a: 920b str r2, [sp, #44] ; 0x2c
800081c: 980b ldr r0, [sp, #44] ; 0x2c
800081e: 6048 str r0, [r1, #4]
8000820: 1c70 adds r0, r6, #1
8000822: 900a str r0, [sp, #40] ; 0x28
8000824: 4050 eors r0, r2
8000826: 9b0a ldr r3, [sp, #40] ; 0x28
8000828: 608b str r3, [r1, #8]
800082a: 9009 str r0, [sp, #36] ; 0x24
800082c: 9809 ldr r0, [sp, #36] ; 0x24
800082e: 60c8 str r0, [r1, #12]
8000830: f000 ff8e bl 8001750 <core::ops::function::FnOnce::call_once>
08001750 <core::ops::function::FnOnce::call_once>:
8001750: b580 push {r7, lr}
8001752: 466f mov r7, sp
8001754: f000 f800 bl 8001758 <drv_stm32h7_startup::system_init_custom::{{closure}}>
08001758 <drv_stm32h7_startup::system_init_custom::{{closure}}>:
8001758: b580 push {r7, lr}
800175a: 466f mov r7, sp
800175c: 4802 ldr r0, [pc, #8] ; (8001768 <drv_stm32h7_startup::system_init_custom::{{closure}}+0x10>)
800175e: 3801 subs r0, #1
8001760: d1fd bne.n 800175e <drv_stm32h7_startup::system_init_custom::{{closure}}+0x6>
8001762: f000 f803 bl 800176c <cortex_m::peripheral::scb::<impl cortex_m::peripheral::SCB>::sys_reset>
8001766: bf00 nop
8001768: 00621d31 .word 0x00621d31
Both cases include the tight subs / bne.n
loop which actually performs the delay, and the delay word is correct in both places, at 1 + 12860000 / 2
.
lib/measurement-token/src/lib.rs
Outdated
/// measurement is valid, or `false` if we exceeded `retry_count`. | ||
/// | ||
/// `delay_and_reset` should include a delay, to give the RoT time to boot. | ||
pub unsafe fn check(retry_count: u32, delay_and_reset: fn() -> !) -> bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see us using the return value at the moment, what would you see the SP doing with this information?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure yet, but I figured more information was better than less information.
For example, once the SP starts measuring the host, it could know not to bother sending the measurement to the RoT if it hasn't been measured itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/not to both sending/not to bother sending/
thinking about causes of failure to measure
The common cases for failure that come to mind are SWD errors, powered debug dongle attached to SP, and back-rev RoT Hubris image that is ignorant of this scheme.
The RoT may also be reset at any time thereby loosing the SP measurement. Normally this would be for RoT update, but some bug or alpha particles could also be involved.
The confidence that the SP has been measured needs to be occasionally refreshed by polling the attestation log to see if a measurement is present. Maybe there is an inventory refresh that will take care of this? Does the SP care for its own purposes?
Is there good use for the return value from check
?
If the SP knows that it has not been measured, that fact should be part of a status command or ereport
.
The SP can query the RoT attestation log if it wants a definitive answer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The confidence that the SP has been measured needs to be occasionally refreshed by polling the attestation log to see if a measurement is present. Maybe there is an inventory refresh that will take care of this? Does the SP care for its own purposes?
We've placed this responsibility on higher-level software (bootstrap agent / control plane), see section §1.8 in the RFD. I don't think the SP itself cares!
7f6768b
to
d7b0ab8
Compare
drv/lpc55-swd/src/main.rs
Outdated
// Deposit the measurement token into SP memory | ||
if let Err(e) = self.write_single_target_addr( | ||
measurement_token::MEASUREMENT_BASE as u32, | ||
measurement_token::MEASUREMENT_TOKEN_VALID, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be done in endoscope for less complexity here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed in a different thread https://github.com/oxidecomputer/hubris/pull/2138/files#r2198567405
lib/measurement-token/src/lib.rs
Outdated
|
||
// These are all magic numbers created by hashing various sentences. They have | ||
// no special significant, just 32 bits that are unlikely to be chosen by | ||
// accident. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're depositing these in a well known location (not scanning memory to find them). So, using values easy to read in a hex dump should have no negative consequences.
4-byte ASCII values like 'Done', 'Skip', and 'Cntr'?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should move this discussion to the measurement-token
crate PR
lib/measurement-token/src/lib.rs
Outdated
pub const MEASUREMENT_TOKEN_SKIP: u32 = 0x9f38bd71; | ||
const COUNTER_TAG: u32 = 0x4e423d17; | ||
|
||
pub const MEASUREMENT_BASE: usize = 0x2000_0000; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Address conflict
0x2000_000 is the base of the stack for endoscope
. See lib/endoscope/scripts/stm32h753.x
SP/RoT shared memory map
Ideally there is a single source of truth for the SP memory layout shared with the RoT.
That implies a linker script fragment that endoscope also includes. or that lays out everything, or extracting the addresses from the endoscope artifact itself (See the prepare_endoscope
function in drv/lpc55-swd/build.rs
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First, this isn't an address conflict: we load the measurement token after endoscope
has finished running, and the SP is reset into its usual firmware and waiting. We guarantee that the address is available in the SP by using a dedicated extern-region
.
I agree about the single shared source of truth, but disagree that endoscope
is the right place. We need to share a value between the SP, RoT, and Humility (see this branch), which suggests that it should be a crate external to the Hubris repo (e.g. in lp55_support
?)
In fact, endoscope
is not even a shared source of truth between the SP and RoT! The RoT is responsible for both loading the endoscope
binary into SP memory, and reading the resulting measurements back; the SP firmware is not involved at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error path I was thinking about is not a problem in practice (fault between running endoscope
and RoT writing token).
Endoscope+RoT would have to have a bug where it overwrote the token area and failed to report a valid result to the RoT. The RoT would reset the SP without trapping. The SP would boot normally, but it would look like a first boot to the SP which would then repeat. Since this doesn't already happen, it would require a change to endoscope
to introduce the bug. The RoT only writes the token on success.
endoscope
is STM32H753 firmware, not SP or RoT firmware. The swd
task derives addresses from the endoscope
ELF artifact at build time which were assigned via the stm32h753.x
file at endoscope
's link time. The agreed upon data formats are in lib/endoscope-abi/
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think I disagree with any of those observations, but I can't tell if you're suggesting further changes to the PR.
Right now, my disposition is to add a new repo that defines the token values + address (oxidecomputer/lpc55_support#94), then use it in both Humility (oxidecomputer/humility#552) and this PR (98022a6).
lib/measurement-token/src/lib.rs
Outdated
/// measurement is valid, or `false` if we exceeded `retry_count`. | ||
/// | ||
/// `delay_and_reset` should include a delay, to give the RoT time to boot. | ||
pub unsafe fn check(retry_count: u32, delay_and_reset: fn() -> !) -> bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/not to both sending/not to bother sending/
thinking about causes of failure to measure
The common cases for failure that come to mind are SWD errors, powered debug dongle attached to SP, and back-rev RoT Hubris image that is ignorant of this scheme.
The RoT may also be reset at any time thereby loosing the SP measurement. Normally this would be for RoT update, but some bug or alpha particles could also be involved.
The confidence that the SP has been measured needs to be occasionally refreshed by polling the attestation log to see if a measurement is present. Maybe there is an inventory refresh that will take care of this? Does the SP care for its own purposes?
Is there good use for the return value from check
?
If the SP knows that it has not been measured, that fact should be part of a status command or ereport
.
The SP can query the RoT attestation log if it wants a definitive answer.
chips/stm32h7/memory-large.toml
Outdated
@@ -14,6 +14,14 @@ read = true | |||
write = true | |||
execute = false # let's assume XN until proven otherwise | |||
|
|||
# We use DTCM for a handoff region |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
turbo nit: it might be nice if this comment explained a bit more about what that meant, so that the reader knows where to go for more information?
/// In all cases, the reset line is not asserted when this function returns. | ||
/// Additionally, `DEMCR.VC_CORERESET` is always cleared, so future resets | ||
/// will not trigger a vector catch. | ||
fn reset_into_debug_halt(&mut self) -> Result<(), ()> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe there's a clippy lint about using ()
as an error type, which we should either allow here or turn off globally1
Footnotes
-
Personally, I don't love this lint. Especially in embedded projects where a meaningfully formattable error type might have some binary size cost and won't get formatted to a user anyway, it's just saying "go make a zero sized struct for your thing", which I think is often not actually worth the effort... ↩
drv/lpc55-swd/src/main.rs
Outdated
// Reset the SP into normal operation | ||
self.disable_halting_debug(); | ||
self.sp_reset_enter(); | ||
hl::sleep_for(1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a place where i might want to see a comment about "where does 1ms come from", but maybe there isn't an obvious "it says at least this long in " for it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm finding it weirdly hard to find a source for how long NRST
must be held low – maybe because RC conditions vary depending on design?
The internal reset generates a 20µs pulse according to this PDF, so 1 ms is presumably plenty.
lib/measurement-token/src/lib.rs
Outdated
let token = core::ptr::read_volatile(ptr); | ||
let tag = core::ptr::read_volatile(ptr.wrapping_add(1)); | ||
let counter = core::ptr::read_volatile(ptr.wrapping_add(2)); | ||
let check = core::ptr::read_volatile(ptr.wrapping_add(3)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea that this diff changes anything is pretty weird..perhaps we could construct something in godbolt to simulate this?
eadb36a
to
98022a6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
license-eye has totally checked 588 files.
Valid | Invalid | Ignored | Fixed |
---|---|---|---|
587 | 1 | 0 | 0 |
Click to see the invalid file list
- lib/measurement-handoff/build.rs
a9566ab
to
d041323
Compare
d041323
to
8b21ed6
Compare
This PR implements RFD 568: SP resets and measurement
The goal of the RFD is to solve the cold boot problem, where the RoT takes several seconds to start, but must measure the SP at reset. For reasons described in the RFD, we choose to have the SP reset itself at startup, giving the RoT a chance to catch it. More specifically:
Together, these changes mean that if we power on a system, the SP will reset itself about about 12 times over the course of 2.5 seconds (with 200ms pauses in between) before the RoT boots up. It will then be measured by the RoT, and continue to boot normally. If the RoT isn't present, the SP stops resetting after about 4 seconds (20 resets).
There are a bunch of changes to make this possible!
kernel
now knows aboutextern-region
s and generatesBASE
/END
symbols in its linker script. This lets us specify ahandoff
region from which it reads the tokens. Unfortunately, this address has to be hard-coded in the RoT, since the RoT doesn't have a way of determining its location from the SP.lpc55-swd
to extractreset_into_debug_halt
into a standalone function, because it's now used for both measurements and to reset the SP before depositing the token into memory. There are various other cleanups here; in particular, I removed the statefulUndo
bitfield in favor of smaller functions which cleaning up before returning.Questions
u32
s because they can be written in a single SWD operation, and 1/4B seems like good odds, but we use 128-bit values for the Bootleby boot codes."handoff"
feature enabled. The alternate token tells the SP that it hasn't been measured, but it should continue booting regardless. This all works fine, but is very hard-coded; should we instead have something like "pre-main memory writes" in the manifest, or move the magic values to a common crate?