Skip to content

net not initializing itself makes everyone else wait on it #2183

@Aaron-Hartwig

Description

@Aaron-Hartwig

When debugging mfg-troubleshooting#110, we grew suspicious of the 2.5V PHY rail not being up. The failure mode here was that we timed-out when try to do a hiffy -c ControlPLaneAgent.set_startup_options. The fact that we seemed otherwise fine was strange, but looking at tasks, everyone was waiting on net

$ pfexec humility tasks
humility: attached to 0483:3754:004E001F3133510337363734 via ST-Link V3
system time = 4415
ID TASK                       GEN PRI STATE
 0 jefe                         0   0 recv, notif: fault timer(T+85)
 1 net                          0   5 notif: bit31(T+3)
 2 sys                          0   1 recv, notif: exti-wildcard-irq(irq6/irq7/irq8/irq9/irq10/irq23/irq40)
 3 spi2_driver                  0   3 recv
 4 i2c_driver                   0   3 recv
 5 spd                          0   2 notif: i2c1-irq(irq31/irq32)
 6 packrat                      0   1 recv
 7 thermal                      0   5 recv, notif: timer(T+858)
 8 power                        0   6 recv, notif: timer(T+864)
 9 hiffy                        0   5 notif: bit31(T+250)
10 gimlet_seq                   0   4 recv, notif: timer vcore
11 gimlet_inspector             0   6 wait: send to net/gen0
12 hash_driver                  0   2 recv
13 hf                           0   3 recv
14 update_server                0   3 recv
15 sensor                       0   4 recv
16 host_sp_comms                0   8 recv, notif: jefe-state-change usart-irq(irq82) multitimer(T+37) control-plane-agent
17 udpecho                      0   6 wait: send to net/gen0
18 udpbroadcast                 0   6 wait: send to net/gen0
19 control_plane_agent          0   7 wait: send to net/gen0
20 sprot                        0   4 recv
21 validate                     0   5 recv
22 vpd                          0   4 recv
23 user_leds                    0   2 recv, notif: timer
24 dump_agent                   0   6 wait: send to net/gen0
25 sbrmi                        0   4 recv
26 idle                         0   9 RUNNING
27 udprpc                       0   6 wait: send to net/gen0

net is spending a lot of time in crappy_spin_until it seems:

$ pfexec humility tasks -sl net
humility: attached to 0483:3754:004E001F3133510337363734 via ST-Link V3
system time = 83751
ID TASK                       GEN PRI STATE
 1 net                          0   5 notif: bit31(T+3)
   |
   +--->  0x24010f60 0x0807bc9a userlib::sys_recv_stub
                     @ /hubris/sys/userlib/src/lib.rs:368
          0x24010fa8 0x0807bd20 userlib::sys_get_timer
                     @ /hubris/sys/userlib/src/lib.rs:1130
          0x24010fa8 0x0807bcf8 userlib::hl::sleep_until
                     @ /hubris/sys/userlib/src/hl.rs:425
          0x24010fa8 0x0807bd20 userlib::hl::sleep_for
                     @ /hubris/sys/userlib/src/hl.rs:459
          0x24011f40 0x08072934 drv_stm32h7_eth::crappy_spin_until
                     @ /hubris/drv/stm32h7-eth/src/lib.rs:59
          0x24011f40 0x08072942 main
                     @ /hubris/task/net/src/main.rs:163

If the PHY isn't powered/responding, we seem to just be stuck in this indefinitely.

We should make this sort of problem fail louder or perhaps more gracefully.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions