-
Notifications
You must be signed in to change notification settings - Fork 745
Description
Hi, I’m running Kubernetes with the latest nvidia/k8s-device-plugin:v0.17.3 on a Jetson AGX Orin node with JetPack 6.2, and I encountered a persistent crash issue when enabling the MPS Control Daemon.
I know that JetPack 6.1 added official support for MPS on Jetson, and I can confirm that MPS works correctly on my Jetson AGX Orin outside of containers — for example, I’m able to manually start the MPS control daemon using: nvidia-cuda-mps-control -d on the host without any issues.
However, when I deploy the device plugin with MPS enabled and set replicas: 4, I noticed that no GPU resources show up on the node at all.
In addition, when I check the plugin pods via: kubectl get pods -n nvidia-device-plugin -o wide, I see that some of the nvidia-device-plugin-mps-control-daemon pods are in a CrashLoopBackOff state.
Here is the error log from the crashing container (mps-control-daemon-ctr):
E0904 09:35:53.754750 312 main.go:84] error starting plugins: error getting daemons: error building device map: error building device map from config.resources: error building GPU device map: error visiting device: error building Device: error getting device paths: error getting GPU device minor number: Not Supported
To further investigate, I also tried switching to Time Slicing mode instead of MPS, and that worked perfectly: GPU resources were reported correctly, and no pods crashed.
So I’d like to ask:
Is it currently feasible to share GPU resources via MPS on Jetson Orin devices in Kubernetes?
This approach works well on traditional x86 platforms with discrete GPUs (e.g., RTX 2080, A100, etc.), but I’m unsure whether MPS-based sharing is officially supported on Jetson-class embedded GPUs.
If this is currently unsupported, is there any workaround or future roadmap for enabling it?
Thanks in advance!