Skip to content

SP does not properly translate version in VersionedRotBootInfo message #2185

@lzrd

Description

@lzrd

SP does not properly translate version in VersionedRotBootInfo message

Problem

The VersionedRotBootInfo message retrieves unattested boot and image state from the RoT using a versioned protocol for backward/forward compatibility. MGS uses one-based versions while RoT uses zero-based versions, requiring translation by the SP.

The SP has two related issues:

  1. No version translation: Passes MGS version requests directly to RoT without converting one-based to zero-based
  2. No version limiting: Doesn't clamp requests to versions the SP can handle, allowing RoT to return response variants the SP cannot deserialize

This causes no production issues currently since MGS only requests HIGHEST_KNOWN_VERSION. However, adding a new RotBootInfo variant will cause deserialization failures during mixed-version deployments unless both issues are fixed.

Impact

Mixed-version deployments (master SP + new RoT) will experience failures:

  • HIGHEST_KNOWN_VALUE requests return deserialization errors
  • Firmware updates and health checks affected

The proposed SP fix (fix-sp-rbi-translation) addresses both issues by properly translating and limiting version requests.

Testing

Note that in these tests, versions above and below the implemented versions are
used to test the corner conditions. Only the responses to requesting the
RotBootInfo::HIGHEST_KNOWN_VERSION would impact a production environment.
In the case of the current MGS main branch, this value is '3'. When a new
variant is introduced, it will be incremented to '4'.

The tables below are from running tests between different versions.

  • main branch of management-gateway-service (current production deployment)
  • master branch of Hubris (current production SP/RoT)
  • fix is a Hubris branch that implements the proposed fix in the SP image (fix-sp-rbi-translation)
  • bdl is an MGS or Hubris branch that has both the SP fix and a new RotBootInfo variant implemented (boot-decision-log)

NOTE: In these tables, only the MGS versioning is used. The actual RoT version is one less than the MGS version.

Current Implementation

MGS SP RoT MGS Request Version RoT Response/Error
main master master HIGHEST_KNOWN_VALUE V3
main master master 0 Error response from SP: update: RoT boot info version is not supported
main master master 1 V1
main master master 2 V2
main master master 3 V3
main master master 4 V3
main master master 5 V3

This is our current implementation, no problems since V3 is the highest version available, but V1 should return an error.
From the RoT's point of view, there is no version zero (MGS V1).

** CRITICAL ISSUE: Adding new RoT response with current SP implementation**

During update, the RoT and SP will temporarily have mismatched versions.
Because the SP is not clamping the version to the SP's own highest known version, the response can be a variant that the SP cannot deserialize.

MGS SP RoT MGS Request Version RoT Response/Error
main master bdl HIGHEST_KNOWN_VALUE ❌ Error response from SP: sprot: failed to deserialize message
main master bdl 0 Error response from SP: update: RoT boot info version is not supported
main master bdl 1 V1
main master bdl 2 V2
main master bdl 3 ❌ Error response from SP: sprot: failed to deserialize message
main master bdl 4 ❌ Error response from SP: sprot: failed to deserialize message
main master bdl 5 ❌ Error response from SP: sprot: failed to deserialize message

This is the critical case we want to avoid - The HIGHEST_KNOWN_VALUE request fails completely, which would break update mechanisms and health checks that rely on this call.

Current Implementation + Proposed SP Fix

With the proposed fix and only the SP updated, the deprecated RoT V1 variant is properly "not supported"

Here, the SP is refusing to map a requested V0 from MGS because there is no -1 in the
RoT's version: u8.

The serialization failures that occur without this fix are avoided.

The fixed SP version does not impact any production code but it's changed
behavior can be probed with faux-mgs ... rot-boot-info -v$N

MGS SP RoT MGS Request Version RoT Response/Error
main fix master HIGHEST_KNOWN_VALUE V3
main fix master 0 Error response from SP: unsupported request for this SP component
main fix master 1 Error response from SP: update: RoT boot info version is not supported
main fix master 2 V2
main fix master 3 V3
main fix master 4 V3
main fix master 5 V3

Current MGS + Proposed SP Fix + New RoT Response

A "fixed" SP that does not know about the new message variant and an RoT that
has the new variant, still works with the older MGS because MGS requests it's
highest known version (3) which results in a V3 response.

The SP is also clamping the version to V3, so a faux-mgs probe of V4 and V5 also
return V3.

MGS SP RoT MGS Request Version RoT Response/Error
main fix bdl HIGHEST_KNOWN_VALUE V3
main fix bdl 0 Error response from SP: unsupported request for this SP component
main fix bdl 1 Error response from SP: update: RoT boot info version is not supported
main fix bdl 2 V2
main fix bdl 3 V3
main fix bdl 4 V3
main fix bdl 5 V3

⚠️ MGS Compatibility Issue with New Response

If both the SP and RoT know about the new variant, then production code update
still works (returning a V3 response to MGS), but one can use faux-mgs to
elicit a V4 response that MGS does not understand.

The RPC timeout occurs because MGS cannot deserialize the V4 response
and adopts a retry strategy that ultimately fails when the overall
timeout expires.

MGS SP RoT MGS Request Version RoT Response/Error
main bdl bdl HIGHEST_KNOWN_VALUE V3
main bdl bdl 0 Error response from SP: unsupported request for this SP component
main bdl bdl 1 Error response from SP: update: RoT boot info version is not supported
main bdl bdl 2 V2
main bdl bdl 3 V3
main bdl bdl 4 ⚠️ RPC call failed (gave up after 5 attempts)
main bdl bdl 5 ⚠️ RPC call failed (gave up after 5 attempts)

New MGS + Proposed SP Fix + New RoT Response

In the eventual stable configuration where all versions have been updated, we
see that the expected responses are received from all queries.

With all components updated to support the new BootDecisionLog variant,
full V4 support is achieved. The SP's version clamping results in a V4 response
when V5 and higher requests are made.

MGS SP RoT MGS Request Version RoT Response/Error
bdl bdl bdl HIGHEST_KNOWN_VALUE V4
bdl bdl bdl 0 Error response from SP: unsupported request for this SP component
bdl bdl bdl 1 Error response from SP: update: RoT boot info version is not supported
bdl bdl bdl 2 V2
bdl bdl bdl 3 V3
bdl bdl bdl 4 V4
bdl bdl bdl 5 V4

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions