Skip to content

First-class the use case of multiple pods per node in a specific ComputeDomain #309

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
jgehrcke opened this issue Mar 28, 2025 · 2 comments
Labels
feature issue/PR that proposes a new feature or functionality

Comments

@jgehrcke
Copy link
Collaborator

jgehrcke commented Mar 28, 2025

Currently, it is not immediately obvious for how to create a single-node ComputeDomain and launch multiple pods in it.

That's a valid use case for multiple reasons. It is also technically sound: a shared "node-local" IMEX channel works, "same-node MNNVL" is a viable scenario.

We currently support this use case by manually managing the underlying ResourceClaim. A working example is here: https://github.com/jgehrcke/jpsnips-nv/tree/1a462d07b5ba22e78ae32156bf6c730bd94a133c/dra/single-node

(for output, see below)

In the future, this scenario should be easier to implement and not be conceptually different from a ComputeDomain spread across multiple nodes. Ideally, we can follow through with the philosophy of "ComputeDomain follows workload placement" also when the workload is comprised of multiple pods on the same node.

Output:

$ bash 1-node-cd.sh 
+ kubectl get resourceclaim
NAME                                    STATE                AGE
cd1node-compute-domain-shared-channel   allocated,reserved   7m42s
repro2-compute-domain-shared-channel    pending              2d1h
+ kubectl get computedomains.resource.nvidia.com
NAME                     AGE
cd1node-compute-domain   7m42s
+ kubectl delete -f 1-node-cd.yaml
job.batch "cd1node" deleted
+ kubectl delete resourceclaim/cd1node-compute-domain-shared-channel
resourceclaim.resource.k8s.io "cd1node-compute-domain-shared-channel" deleted
+ kubectl delete computedomains.resource.nvidia.com cd1node-compute-domain
computedomain.resource.nvidia.com "cd1node-compute-domain" deleted
+ set +x
computedomain.resource.nvidia.com/cd1node-compute-domain created
CDUID: 10dfde78-ffb2-434c-aea8-79fa6461c5a4
resourceclaim.resource.k8s.io/cd1node-compute-domain-shared-channel created
job.batch/cd1node created
+ sleep 5
+ kubectl wait --for=condition=Ready pods -l batch.kubernetes.io/job-completion-index=0,job-name=cd1node
pod/cd1node-0-7slzl condition met
+ kubectl wait --for=condition=Ready pods -l batch.kubernetes.io/job-completion-index=1,job-name=cd1node --timeout=40s
pod/cd1node-1-2w87z condition met
+ set +x


pods on nodes:
NAME              READY   STATUS    RESTARTS   AGE   IP              NODE                  NOMINATED NODE   READINESS GATES
cd1node-0-7slzl   1/1     Running   0          30s   192.168.81.15   gb-nvl-043-bianca-7   <none>           <none>
cd1node-1-2w87z   1/1     Running   0          30s   192.168.81.6    gb-nvl-043-bianca-7   <none>           <none>
DAEMON_POD: cd1node-compute-domain-bz99b-dnzc9


IMEX daemon status:
READY
Connectivity Table Legend:
I - Invalid - Node wasn't reachable, no connection status available
N - Never Connected
R - Recovering - Connection was lost, but clean up has not yet been triggered.
D - Disconnected - Connection was lost, and clean up has been triggreed.
A - Authenticating - If GSSAPI enabled, client has initiated mutual authentication.
!V! - Version mismatch, communication disabled.
!M! - Node map mismatch, communication disabled.
C - Connected - Ready for operation

3/28/2025 19:45:16.602
Nodes:
Node #0   - 10.115.131.12   - READY                - Version: 570.124.06

 Nodes From\To  0  
       0        C  
Domain State: UP
READY stopAtReady: 0
keepGoing: 1
Finishing subscription
READY


leader log tail:
[pod/cd1node-0-7slzl/cd1node] total 0
[pod/cd1node-0-7slzl/cd1node] drwxr-xr-x 2 root root     60 Mar 28 19:45 .
[pod/cd1node-0-7slzl/cd1node] drwxr-xr-x 6 root root    480 Mar 28 19:45 ..
[pod/cd1node-0-7slzl/cd1node] crw-rw-rw- 1 root root 234, 0 Mar 28 19:45 channel0


follower log tail:
[pod/cd1node-1-2w87z/cd1node] total 0
[pod/cd1node-1-2w87z/cd1node] drwxr-xr-x 2 root root     60 Mar 28 19:45 .
[pod/cd1node-1-2w87z/cd1node] drwxr-xr-x 6 root root    480 Mar 28 19:45 ..
[pod/cd1node-1-2w87z/cd1node] crw-rw-rw- 1 root root 234, 0 Mar 28 19:45 channel0


IMEX daemon log:
[Mar 28 2025 19:45:04] [INFO] [tid 39] nvidia-imex persistence file /var/run/nvidia-imex/persist.dat does not exist.  Assuming no previous importers.
[Mar 28 2025 19:45:04] [INFO] [tid 39] NvGpu Library version matched with GPU Driver version
[Mar 28 2025 19:45:04] [INFO] [tid 90] Started processing of incoming messages.
[Mar 28 2025 19:45:04] [INFO] [tid 91] Started processing of incoming messages.
[Mar 28 2025 19:45:04] [INFO] [tid 92] Started processing of incoming messages.
[Mar 28 2025 19:45:04] [INFO] [tid 93] Started processing of incoming messages.
[Mar 28 2025 19:45:04] [INFO] [tid 39] Creating gRPC channels to all peers (nPeers = 1).
[Mar 28 2025 19:45:04] [INFO] [tid 39] IMEX_WAIT_FOR_QUORUM != FULL, continuing initialization without waiting for connections to all nodes.
[Mar 28 2025 19:45:04] [INFO] [tid 94] Connection established to node 0 with ip address 10.115.131.12. Number of times connected: 1
[Mar 28 2025 19:45:04] [INFO] [tid 39] GPU event successfully subscribed
@guptaNswati
Copy link
Contributor

guptaNswati commented Apr 23, 2025

So basically have the 2 pods reference to the same resoureclaim, which does not happen in the automatic flow. How do we hack creating multiple channels on a single node? This has been a frequent ask from internal users.

@guptaNswati
Copy link
Contributor

BTW this is a good single node IMEX channel test cudaMallocAsync -t ipc_mempools_basic

@klueska klueska added the feature issue/PR that proposes a new feature or functionality label May 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature issue/PR that proposes a new feature or functionality
Projects
None yet
Development

No branches or pull requests

3 participants