You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, it is not immediately obvious for how to create a single-node ComputeDomain and launch multiple pods in it.
That's a valid use case for multiple reasons. It is also technically sound: a shared "node-local" IMEX channel works, "same-node MNNVL" is a viable scenario.
In the future, this scenario should be easier to implement and not be conceptually different from a ComputeDomain spread across multiple nodes. Ideally, we can follow through with the philosophy of "ComputeDomain follows workload placement" also when the workload is comprised of multiple pods on the same node.
Output:
$ bash 1-node-cd.sh
+ kubectl get resourceclaim
NAME STATE AGE
cd1node-compute-domain-shared-channel allocated,reserved 7m42s
repro2-compute-domain-shared-channel pending 2d1h
+ kubectl get computedomains.resource.nvidia.com
NAME AGE
cd1node-compute-domain 7m42s
+ kubectl delete -f 1-node-cd.yaml
job.batch "cd1node" deleted
+ kubectl delete resourceclaim/cd1node-compute-domain-shared-channel
resourceclaim.resource.k8s.io "cd1node-compute-domain-shared-channel" deleted
+ kubectl delete computedomains.resource.nvidia.com cd1node-compute-domain
computedomain.resource.nvidia.com "cd1node-compute-domain" deleted
+ set +x
computedomain.resource.nvidia.com/cd1node-compute-domain created
CDUID: 10dfde78-ffb2-434c-aea8-79fa6461c5a4
resourceclaim.resource.k8s.io/cd1node-compute-domain-shared-channel created
job.batch/cd1node created
+ sleep 5
+ kubectl wait --for=condition=Ready pods -l batch.kubernetes.io/job-completion-index=0,job-name=cd1node
pod/cd1node-0-7slzl condition met
+ kubectl wait --for=condition=Ready pods -l batch.kubernetes.io/job-completion-index=1,job-name=cd1node --timeout=40s
pod/cd1node-1-2w87z condition met
+ set +x
pods on nodes:
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
cd1node-0-7slzl 1/1 Running 0 30s 192.168.81.15 gb-nvl-043-bianca-7 <none> <none>
cd1node-1-2w87z 1/1 Running 0 30s 192.168.81.6 gb-nvl-043-bianca-7 <none> <none>
DAEMON_POD: cd1node-compute-domain-bz99b-dnzc9
IMEX daemon status:
READY
Connectivity Table Legend:
I - Invalid - Node wasn't reachable, no connection status available
N - Never Connected
R - Recovering - Connection was lost, but clean up has not yet been triggered.
D - Disconnected - Connection was lost, and clean up has been triggreed.
A - Authenticating - If GSSAPI enabled, client has initiated mutual authentication.
!V! - Version mismatch, communication disabled.
!M! - Node map mismatch, communication disabled.
C - Connected - Ready for operation
3/28/2025 19:45:16.602
Nodes:
Node #0 - 10.115.131.12 - READY - Version: 570.124.06
Nodes From\To 0
0 C
Domain State: UP
READY stopAtReady: 0
keepGoing: 1
Finishing subscription
READY
leader log tail:
[pod/cd1node-0-7slzl/cd1node] total 0
[pod/cd1node-0-7slzl/cd1node] drwxr-xr-x 2 root root 60 Mar 28 19:45 .
[pod/cd1node-0-7slzl/cd1node] drwxr-xr-x 6 root root 480 Mar 28 19:45 ..
[pod/cd1node-0-7slzl/cd1node] crw-rw-rw- 1 root root 234, 0 Mar 28 19:45 channel0
follower log tail:
[pod/cd1node-1-2w87z/cd1node] total 0
[pod/cd1node-1-2w87z/cd1node] drwxr-xr-x 2 root root 60 Mar 28 19:45 .
[pod/cd1node-1-2w87z/cd1node] drwxr-xr-x 6 root root 480 Mar 28 19:45 ..
[pod/cd1node-1-2w87z/cd1node] crw-rw-rw- 1 root root 234, 0 Mar 28 19:45 channel0
IMEX daemon log:
[Mar 28 2025 19:45:04] [INFO] [tid 39] nvidia-imex persistence file /var/run/nvidia-imex/persist.dat does not exist. Assuming no previous importers.
[Mar 28 2025 19:45:04] [INFO] [tid 39] NvGpu Library version matched with GPU Driver version
[Mar 28 2025 19:45:04] [INFO] [tid 90] Started processing of incoming messages.
[Mar 28 2025 19:45:04] [INFO] [tid 91] Started processing of incoming messages.
[Mar 28 2025 19:45:04] [INFO] [tid 92] Started processing of incoming messages.
[Mar 28 2025 19:45:04] [INFO] [tid 93] Started processing of incoming messages.
[Mar 28 2025 19:45:04] [INFO] [tid 39] Creating gRPC channels to all peers (nPeers = 1).
[Mar 28 2025 19:45:04] [INFO] [tid 39] IMEX_WAIT_FOR_QUORUM != FULL, continuing initialization without waiting for connections to all nodes.
[Mar 28 2025 19:45:04] [INFO] [tid 94] Connection established to node 0 with ip address 10.115.131.12. Number of times connected: 1
[Mar 28 2025 19:45:04] [INFO] [tid 39] GPU event successfully subscribed
The text was updated successfully, but these errors were encountered:
So basically have the 2 pods reference to the same resoureclaim, which does not happen in the automatic flow. How do we hack creating multiple channels on a single node? This has been a frequent ask from internal users.
Uh oh!
There was an error while loading. Please reload this page.
Currently, it is not immediately obvious for how to create a single-node ComputeDomain and launch multiple pods in it.
That's a valid use case for multiple reasons. It is also technically sound: a shared "node-local" IMEX channel works, "same-node MNNVL" is a viable scenario.
We currently support this use case by manually managing the underlying
ResourceClaim
. A working example is here: https://github.com/jgehrcke/jpsnips-nv/tree/1a462d07b5ba22e78ae32156bf6c730bd94a133c/dra/single-node(for output, see below)
In the future, this scenario should be easier to implement and not be conceptually different from a ComputeDomain spread across multiple nodes. Ideally, we can follow through with the philosophy of "ComputeDomain follows workload placement" also when the workload is comprised of multiple pods on the same node.
Output:
The text was updated successfully, but these errors were encountered: