Skip to content

add support of thunderbolt hotplug #842

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
1 of 2 tasks
bdandy opened this issue Apr 29, 2025 · 1 comment
Open
1 of 2 tasks

add support of thunderbolt hotplug #842

bdandy opened this issue Apr 29, 2025 · 1 comment
Labels
bug Something isn't working

Comments

@bdandy
Copy link

bdandy commented Apr 29, 2025

NVIDIA Open GPU Kernel Modules Version

NVIDIA UNIX Open Kernel Mode Setting Driver for x86_64 570.144 Release Build (notroot@4edfd97358ec) Fri Apr 25 18:29:08 UTC 2025

Please confirm this issue does not happen with the proprietary driver (of the same version). This issue tracker is only for bugs specific to the open kernel driver.

  • I confirm that this does not happen with the proprietary driver package.

Operating System and Version

CachyOS latest

Kernel Release

Linux cachyos 6.14.4-2-cachyos #1 SMP PREEMPT_DYNAMIC Fri, 25 Apr 2025 18:15:23 +0000 x86_64 GNU/Linux

Please confirm you are running a stable release kernel (e.g. not a -rc). We do not accept bug reports for unreleased kernels.

  • I am running on a stable kernel release.

Hardware: GPU

GPU 0: NVIDIA GeForce RTX 3060 (UUID: GPU-0fded5d6-1fad-c667-3302-31e117ab858a)

Describe the bug

I'm using dGPU with thunderbolt 3 case. The main issue that during sleep/suspend or reconnecting device kernel module goes to "fallen off the bus" mode, without reconnecting. Any usage of device is not possible until reboot.

Hotplug works on Windows btw.

To Reproduce

Disconnect dGPU via thunderbolt and connect again

Bug Incidence

Always

nvidia-bug-report.log.gz

nvidia-bug-report.log.gz

More Info

No response

@bdandy bdandy added the bug Something isn't working label Apr 29, 2025
@artlav
Copy link

artlav commented May 12, 2025

Can confirm that this doesn't happen with the proprietary driver.
Got a 3070 in an "Intel Tamales Module 2" type eGPU case, and i can hot plug and unplug it without any issues with it.

With the open driver, it does the "NVRM: Xid (PCI:0000:09:00): 79, GPU has fallen off the bus." thing on unplug, with a bunch of

[   54.962032] NVRM: _issueRpcAndWait: rpcSendMessage failed with status 0x0000000f for fn 10!
[   54.962034] NVRM: rpcRmApiFree_GSP: GspRmFree failed: hClient=0xc1d00010; hObject=0xbeef502d; paramsStatus=0x00000000; status=0x0000000f
[   54.962036] NVRM: nvAssertFailedNoLog: Assertion failed: status == NV_OK @ rs_client.c:843
[   54.962039] NVRM: nvAssertFailedNoLog: Assertion failed: status == NV_OK @ rs_server.c:259
[   54.962042] NVRM: nvAssertFailedNoLog: Assertion failed: status == NV_OK @ rs_server.c:1375

in dmesg.

After that, replugging it does not work (it's detected and nvidia-smi and nvtop works, but nothing runs on it).
Trying to run, say, vkcube, gives these errors:

[  119.470389] NVRM: nvAssertFailedNoLog: Assertion failed: _serverGetClientEntryByHandle(pServer, hClient2, 0, CLIENT_LIST_LOCK_UNLOCKED, ppClientEntry2nd) @ rs_server.c:3467
[  119.470401] NVRM: RmExportObject: pRmApi->DupObject(Dev, failed with error code 0x33 in RmExportObject

Also the system won't go into suspend and requires a reboot.

What makes this a more insidious problem than just hot plugging is that the link on these eGPU boxes can be a bit flakey under load.
The proprietary driver would just roll over a link upset, with maybe a hiccup in game.
Meanwhile, the open driver fails completely as if it got unplugged.

I'm on Framework 16 laptop with arch linux, for context.
Linux frx 6.14.6-arch1-1 #1 SMP PREEMPT_DYNAMIC Fri, 09 May 2025 17:36:18 +0000 x86_64 GNU/Linux

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants