Skip to content

New channels do not appear in ResourceSlices #354

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
robertdavidsmith opened this issue May 12, 2025 · 5 comments
Open

New channels do not appear in ResourceSlices #354

robertdavidsmith opened this issue May 12, 2025 · 5 comments
Labels
question Categorizes issue or PR as a support question.

Comments

@robertdavidsmith
Copy link

robertdavidsmith commented May 12, 2025

Hi,

Currently if you make new channels (e.g use mknod to make files such as /dev/nvidia-caps-imex-channels/channel3), these don't show up in the ResourceSlices. This is true even after restarting all dra driver pods. Have you any plans to fix? Alternatively would you accept a PR to fix?

Thanks,

Rob

@jgehrcke jgehrcke added the question Categorizes issue or PR as a support question. label May 12, 2025
@jgehrcke
Copy link
Collaborator

jgehrcke commented May 12, 2025

Hello, Rob!

if you make new channels (e.g use mknod to make files such as /dev/nvidia-caps-imex-channels/channel3), these don't show up in the ResourceSlice

That is expected.

The ComputeDomain construct manages IMEX channels (and IMEX daemons, for that matter) under the hood. With the ComputeDomain primitive, we can treat anything-IMEX as an implementation detail. That is, as a user one would never go in and "create an IMEX channel". Orchestrating IMEX primitives is the responsibility of the ComputeDomain logic/implementation and generally one should not interfere with that.

Is your motivation to offer more than one IMEX channel per single ComputeDomain? If so: what use case do you have in mind for that?

Currently -- by design -- one ComputeDomain is backed by precisely one IMEX channel (we picked channel zero for that). Also by current design, there is no further sub-division within one ComputeDomain (processes associated with one ComputeDomain are meant to see each other via that single shared IMEX channel).

The ComputeDomain concept is still in its infancy and we are certainly looking forward to making it more flexible and robust and powerful in the future. For example, we might be looking into using different channels as part of supporting more than one ComputeDomain per node (#353).

@robertdavidsmith
Copy link
Author

robertdavidsmith commented May 13, 2025

Hello @jgehrcke,

Thanks for your comment, understood.

The use case is that we have multiple namespaces which may be running jobs at once. For example we may have two 32-GPU jobs, and two 2-GPU jobs from 4 different namespaces running on one NVL72. Ideally there would be a security boundary between the namespaces.

Being new to IMEX, I was thinking of making either a ComputeDomain per namespace, or a ComputeDomain per job. Then I wanted to do the following for each namespace (there will be ~100 namespaces).

I now understand the above won't work for 2-GPU jobs sharing a k8s node. Would it even work if the smallest job took a whole 4-GPU node?

What would you recommend doing to implement separation between namespaces? If we just put all namespaces on one imex channel how big a security concern is this?

Thanks,

Rob

@robertdavidsmith
Copy link
Author

(Note also closely related #351)

@jgehrcke
Copy link
Collaborator

jgehrcke commented May 16, 2025

The use case is that we have multiple namespaces which may be running jobs at once [...] Ideally there would be a security boundary between the namespaces.

Perfect. The ComputeDomain (CD) primitive is precisely here to provide that security boundary. The security isolation between jobs in different CDs in different namespaces is strong. That is our ambition.

Being new to IMEX, I was thinking of making either a ComputeDomain per namespace, or a ComputeDomain per job.

A CD is really meant to be tied to specific workload (to "one job").

Our idea is for a ComputeDomain to form around workload on the fly.

This magic for automatic creation and teardown of a CD is enabled by the ResourceClaimTemplate approach as shown in this example (note how the pod spec refers to a resourceClaimTemplate, which is also defined in the same YAML document).

In that case, the CD is formed automatically and dynamically around the job (the k8s pods). That implies forming a short-lived, single-channel IMEX domain under the hood, which is properly torn down upon job completion.

Then I wanted to do the following for each namespace (there will be ~100 namespaces): [...]

I believe and hope that none of what you wrote next is actually required! :)

Once the CD is formed, the containers that use it (across pods across nodes) all have IMEX channel0 injected, and can use it.

What would you recommend doing to implement separation between namespaces? If we just put all namespaces on one imex channel how big a security concern is this?

The general idea is that when you use ComputeDomains as intended then isolation is done for you.

Next, let me respond a little more in-depth about the relationship between CDs and k8s namespaces, and about actual security.

GPU memory can only be shared among containers (via NVLink/IMEX) when those containers are all within the same CD. For clarity:

  • When they are in the same CD, they automatically have access to a shared IMEX channel.
  • When they are not in the same CD, then there is no shared IMEX channel (there might be physical NVLink-linkage, but that cannot be misused; (the lack of) IMEX-connectivity guarantees that).

A user that has access to the k8s namespace that a job and CD are deployed in can obviously inject itself and access GPU memory of that job after all.

That is, to enforce the boundary that a CD provides (from an actual security perspective), one needs to make sure that a bad actor does not have access to the same k8s namespace that the (to-be-secured) job and CD are deployed in.

So, your plan to have many namespaces to isolate users and jobs from each other is exactly in alignment with our security/threat model.

In other words:

  • CDs in separate namespaces provide actual security isolation -- a user that has access to one k8s namespace and runs jobs in that namespace cannot reach into GPU memory shared within an IMEX domain belonging to a CD deployed in a different k8s namespace.
  • CDs in the same namespace provide "don't step onto each other's toes" security (often very useful, too).

@robertdavidsmith
Copy link
Author

robertdavidsmith commented May 19, 2025

Thank you for your very detailed reply.

We will create a ComputeDomain per job as you suggest. This could be done by a small armada code change, or by a k8s controller. Then as you say, imex channel 0 is enough, and we can close this ticket.

Would you also recommend additional security such as the below?

  • Using cilium network policy to block access to IMEX SERVER_PORT=50000 from pods
  • Configuring IMEX_ENABLE_AUTH_ENCRYPTION in the nvidia-imex config?
  • Configuring kerberos in the nvidia-imex config?

Thanks,

Rob

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Categorizes issue or PR as a support question.
Projects
None yet
Development

No branches or pull requests

2 participants