First-class the use case of multiple pods per node in a specific ComputeDomain #309

jgehrcke · 2025-03-28T19:58:10Z

Currently, it is not immediately obvious for how to create a single-node ComputeDomain and launch multiple pods in it.

That's a valid use case for multiple reasons. It is also technically sound: a shared "node-local" IMEX channel works, "same-node MNNVL" is a viable scenario.

We currently support this use case by manually managing the underlying ResourceClaim. A working example is here: https://github.com/jgehrcke/jpsnips-nv/tree/1a462d07b5ba22e78ae32156bf6c730bd94a133c/dra/single-node

(for output, see below)

In the future, this scenario should be easier to implement and not be conceptually different from a ComputeDomain spread across multiple nodes. Ideally, we can follow through with the philosophy of "ComputeDomain follows workload placement" also when the workload is comprised of multiple pods on the same node.

Output:

$ bash 1-node-cd.sh 
+ kubectl get resourceclaim
NAME                                    STATE                AGE
cd1node-compute-domain-shared-channel   allocated,reserved   7m42s
repro2-compute-domain-shared-channel    pending              2d1h
+ kubectl get computedomains.resource.nvidia.com
NAME                     AGE
cd1node-compute-domain   7m42s
+ kubectl delete -f 1-node-cd.yaml
job.batch "cd1node" deleted
+ kubectl delete resourceclaim/cd1node-compute-domain-shared-channel
resourceclaim.resource.k8s.io "cd1node-compute-domain-shared-channel" deleted
+ kubectl delete computedomains.resource.nvidia.com cd1node-compute-domain
computedomain.resource.nvidia.com "cd1node-compute-domain" deleted
+ set +x
computedomain.resource.nvidia.com/cd1node-compute-domain created
CDUID: 10dfde78-ffb2-434c-aea8-79fa6461c5a4
resourceclaim.resource.k8s.io/cd1node-compute-domain-shared-channel created
job.batch/cd1node created
+ sleep 5
+ kubectl wait --for=condition=Ready pods -l batch.kubernetes.io/job-completion-index=0,job-name=cd1node
pod/cd1node-0-7slzl condition met
+ kubectl wait --for=condition=Ready pods -l batch.kubernetes.io/job-completion-index=1,job-name=cd1node --timeout=40s
pod/cd1node-1-2w87z condition met
+ set +x


pods on nodes:
NAME              READY   STATUS    RESTARTS   AGE   IP              NODE                  NOMINATED NODE   READINESS GATES
cd1node-0-7slzl   1/1     Running   0          30s   192.168.81.15   gb-nvl-043-bianca-7   <none>           <none>
cd1node-1-2w87z   1/1     Running   0          30s   192.168.81.6    gb-nvl-043-bianca-7   <none>           <none>
DAEMON_POD: cd1node-compute-domain-bz99b-dnzc9


IMEX daemon status:
READY
Connectivity Table Legend:
I - Invalid - Node wasn't reachable, no connection status available
N - Never Connected
R - Recovering - Connection was lost, but clean up has not yet been triggered.
D - Disconnected - Connection was lost, and clean up has been triggreed.
A - Authenticating - If GSSAPI enabled, client has initiated mutual authentication.
!V! - Version mismatch, communication disabled.
!M! - Node map mismatch, communication disabled.
C - Connected - Ready for operation

3/28/2025 19:45:16.602
Nodes:
Node #0   - 10.115.131.12   - READY                - Version: 570.124.06

 Nodes From\To  0  
       0        C  
Domain State: UP
READY stopAtReady: 0
keepGoing: 1
Finishing subscription
READY


leader log tail:
[pod/cd1node-0-7slzl/cd1node] total 0
[pod/cd1node-0-7slzl/cd1node] drwxr-xr-x 2 root root     60 Mar 28 19:45 .
[pod/cd1node-0-7slzl/cd1node] drwxr-xr-x 6 root root    480 Mar 28 19:45 ..
[pod/cd1node-0-7slzl/cd1node] crw-rw-rw- 1 root root 234, 0 Mar 28 19:45 channel0


follower log tail:
[pod/cd1node-1-2w87z/cd1node] total 0
[pod/cd1node-1-2w87z/cd1node] drwxr-xr-x 2 root root     60 Mar 28 19:45 .
[pod/cd1node-1-2w87z/cd1node] drwxr-xr-x 6 root root    480 Mar 28 19:45 ..
[pod/cd1node-1-2w87z/cd1node] crw-rw-rw- 1 root root 234, 0 Mar 28 19:45 channel0


IMEX daemon log:
[Mar 28 2025 19:45:04] [INFO] [tid 39] nvidia-imex persistence file /var/run/nvidia-imex/persist.dat does not exist.  Assuming no previous importers.
[Mar 28 2025 19:45:04] [INFO] [tid 39] NvGpu Library version matched with GPU Driver version
[Mar 28 2025 19:45:04] [INFO] [tid 90] Started processing of incoming messages.
[Mar 28 2025 19:45:04] [INFO] [tid 91] Started processing of incoming messages.
[Mar 28 2025 19:45:04] [INFO] [tid 92] Started processing of incoming messages.
[Mar 28 2025 19:45:04] [INFO] [tid 93] Started processing of incoming messages.
[Mar 28 2025 19:45:04] [INFO] [tid 39] Creating gRPC channels to all peers (nPeers = 1).
[Mar 28 2025 19:45:04] [INFO] [tid 39] IMEX_WAIT_FOR_QUORUM != FULL, continuing initialization without waiting for connections to all nodes.
[Mar 28 2025 19:45:04] [INFO] [tid 94] Connection established to node 0 with ip address 10.115.131.12. Number of times connected: 1
[Mar 28 2025 19:45:04] [INFO] [tid 39] GPU event successfully subscribed

The text was updated successfully, but these errors were encountered:

guptaNswati · 2025-04-23T21:42:38Z

So basically have the 2 pods reference to the same resoureclaim, which does not happen in the automatic flow. How do we hack creating multiple channels on a single node? This has been a frequent ask from internal users.

guptaNswati · 2025-04-23T21:44:51Z

BTW this is a good single node IMEX channel test cudaMallocAsync -t ipc_mempools_basic

klueska added the feature issue/PR that proposes a new feature or functionality label May 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

First-class the use case of multiple pods per node in a specific ComputeDomain #309

First-class the use case of multiple pods per node in a specific ComputeDomain #309

jgehrcke commented Mar 28, 2025 •

edited

Loading

guptaNswati commented Apr 23, 2025 •

edited

Loading

Uh oh!

guptaNswati commented Apr 23, 2025

Uh oh!

First-class the use case of multiple pods per node in a specific ComputeDomain #309

First-class the use case of multiple pods per node in a specific ComputeDomain #309

Comments

jgehrcke commented Mar 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

guptaNswati commented Apr 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

guptaNswati commented Apr 23, 2025

Uh oh!

jgehrcke commented Mar 28, 2025 •

edited

Loading

guptaNswati commented Apr 23, 2025 •

edited

Loading