Skip to content

P2P over 2 nodes (pcie over fiber) #34

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
2 tasks
chardog opened this issue Mar 18, 2025 · 2 comments
Open
2 tasks

P2P over 2 nodes (pcie over fiber) #34

chardog opened this issue Mar 18, 2025 · 2 comments
Labels
bug Something isn't working

Comments

@chardog
Copy link

chardog commented Mar 18, 2025

NVIDIA Open GPU Kernel Modules Version

Na

Please confirm this issue does not happen with the proprietary driver (of the same version). This issue tracker is only for bugs specific to the open kernel driver.

  • I confirm that this does not happen with the proprietary driver package.

Operating System and Version

Ubuntu

Kernel Release

Na

Please confirm you are running a stable release kernel (e.g. not a -rc). We do not accept bug reports for unreleased kernels.

  • I am running on a stable kernel release.

Hardware: GPU

4090

Describe the bug

If I connect two nodes with two mellanox cards (pcie over fiber) and I have 4090 gpus in both servers, will the p2p communication work correctly over NCCL and RDMA?

To Reproduce

Q

Bug Incidence

Once

nvidia-bug-report.log.gz

Q

More Info

If I connect two nodes with two mellanox cards (pcie over fiber) and I have 4090 gpus in both servers, will the p2p communication work correctly over NCCL and RDMA?

@chardog chardog added the bug Something isn't working label Mar 18, 2025
@chardog
Copy link
Author

chardog commented Mar 18, 2025

If I connect two nodes with two mellanox cards (pcie over fiber) and I have 4090 gpus in both servers, will the p2p communication work correctly over NCCL and RDMA?

@zvorinji
Copy link

No. 1) NCCL will block it. But you can get around this by modifying it. 2) Once you've modified it, you will run into the issue where the NIC can't access the memory. Local protection error/bad address. I think the address needs to be translated into a BAR1 address (somehow/somewhere) for this to work. It's possible you also need to modify page sizes and modify the peermem module to incorporate a BAR1 workaround (given lack of GPUDirect RDMA support).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants