You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Please confirm this issue does not happen with the proprietary driver (of the same version). This issue tracker is only for bugs specific to the open kernel driver.
I confirm that this does not happen with the proprietary driver package.
Operating System and Version
Ubuntu
Kernel Release
Na
Please confirm you are running a stable release kernel (e.g. not a -rc). We do not accept bug reports for unreleased kernels.
I am running on a stable kernel release.
Hardware: GPU
4090
Describe the bug
If I connect two nodes with two mellanox cards (pcie over fiber) and I have 4090 gpus in both servers, will the p2p communication work correctly over NCCL and RDMA?
To Reproduce
Q
Bug Incidence
Once
nvidia-bug-report.log.gz
Q
More Info
If I connect two nodes with two mellanox cards (pcie over fiber) and I have 4090 gpus in both servers, will the p2p communication work correctly over NCCL and RDMA?
The text was updated successfully, but these errors were encountered:
If I connect two nodes with two mellanox cards (pcie over fiber) and I have 4090 gpus in both servers, will the p2p communication work correctly over NCCL and RDMA?
No. 1) NCCL will block it. But you can get around this by modifying it. 2) Once you've modified it, you will run into the issue where the NIC can't access the memory. Local protection error/bad address. I think the address needs to be translated into a BAR1 address (somehow/somewhere) for this to work. It's possible you also need to modify page sizes and modify the peermem module to incorporate a BAR1 workaround (given lack of GPUDirect RDMA support).
NVIDIA Open GPU Kernel Modules Version
Na
Please confirm this issue does not happen with the proprietary driver (of the same version). This issue tracker is only for bugs specific to the open kernel driver.
Operating System and Version
Ubuntu
Kernel Release
Na
Please confirm you are running a stable release kernel (e.g. not a -rc). We do not accept bug reports for unreleased kernels.
Hardware: GPU
4090
Describe the bug
If I connect two nodes with two mellanox cards (pcie over fiber) and I have 4090 gpus in both servers, will the p2p communication work correctly over NCCL and RDMA?
To Reproduce
Q
Bug Incidence
Once
nvidia-bug-report.log.gz
Q
More Info
If I connect two nodes with two mellanox cards (pcie over fiber) and I have 4090 gpus in both servers, will the p2p communication work correctly over NCCL and RDMA?
The text was updated successfully, but these errors were encountered: