-
Notifications
You must be signed in to change notification settings - Fork 111
Support RTX 5090 #29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
i am also very intersting in this but first all 5090 should be available for us :) if you got one please check if large bar is still there if not it does not make any sense to try. |
01:00.0 VGA compatible controller: NVIDIA Corporation Device 2b87 (rev a1) (prog-if 00 [VGA controller]) |
The 5090 seems to have some issues with cuda and NCCL Lib adaptation. At present, it seems to cause illegal memory access error when NCCL P2P runs. |
Sat Mar 8 06:06:45 2025 +-----------------------------------------------------------------------------------------+ I do got 2pcs of 5090D and if anyone can make it P2P~ |
I updated the cuda version to 12.8.1, but the simpleP2P and p2pBandwidthLatencyTest still have problems. |
The problem is not about CUDA and NCCL. The driver needs to be modified to fully unlock P2P capability in RTX 5090. I made a quick fix to the driver to enable P2P support on RTX 5090. In diff --git a/kernel-open/common/inc/nv-linux.h b/kernel-open/common/inc/nv-linux.h
index e9daf8a9..fa60e5d4 100644
--- a/kernel-open/common/inc/nv-linux.h
+++ b/kernel-open/common/inc/nv-linux.h
@@ -531,6 +531,7 @@ static inline void *nv_ioremap_cache(NvU64 phys, NvU64 size)
static inline void *nv_ioremap_wc(NvU64 phys, NvU64 size)
{
void *ptr = NULL;
+ return nv_ioremap_nocache(phys, size);
#if IS_ENABLED(CONFIG_INTEL_TDX_GUEST) && defined(NV_IOREMAP_DRIVER_HARDENED_WC_PRESENT)
ptr = ioremap_driver_hardened_wc(phys, size);
#elif defined(NV_IOREMAP_WC_PRESENT)
diff --git a/src/nvidia/inc/libraries/mmu/gmmu_fmt.h b/src/nvidia/inc/libraries/mmu/gmmu_fmt.h
index 31ec4249..581dc897 100644
--- a/src/nvidia/inc/libraries/mmu/gmmu_fmt.h
+++ b/src/nvidia/inc/libraries/mmu/gmmu_fmt.h
@@ -342,7 +342,14 @@ gmmuFieldSetAperture
NvU8 *pMem
)
{
- nvFieldSetEnum(&pAperture->_enum, value, pMem);
+ GMMU_APERTURE new_value;
+ if (value == GMMU_APERTURE_PEER) {
+ new_value = GMMU_APERTURE_SYS_NONCOH;
+ }
+ else {
+ new_value = value;
+ }
+ nvFieldSetEnum(&pAperture->_enum, new_value, pMem);
}
/*! The patch was applied to open driver 570.133.07 (you still need to apply the patch by tinygrad first). In addition, as usually required by P2P, disable IOMMU and PCIe ACS in your BIOS. Some quick tests:
NCCL 2.26.2-1:
|
@huanzhang12 Sorry to bother you, but it is possible to package that into a .run file? I would be glad to test on a 5090. I get a n illegal memory access when trying to use P2P even when using it with cards that aren't Blackwell 2.0 (i.e. 4090 or A6000). EDIT: Managed to apply it and build the driver, but no luck with cards that aren't the 5090.
|
@huanzhang12 while you were testing simpleP2P, did you see any error messages in dmesg like below: |
@legezywzh I saw that error message with 575 driver. The driver structure changed and a new patch would be needed. |
@huanzhang12 Thanks for the quick reply. So with 570.133.20, you don't see these error messages? |
No errors on the 570 branch. I upgraded to 570.144 and it worked fine too. |
Thanks for your patch !! I am using the patched driver 570.133.07 on an AMD Turin CPU with RTX 5090. When I disable the Resizable BAR in the BIOS, turn off IOMMU, and disable PCIe ACS, I can see P2P working successfully. However, when running some tests, CUDA errors occur.
But when I enable Resizable BAR again, keeping all other settings the same as before, the nvidia-smi tool fails to recognize the device, showing "No devices were found." However, when checking with lspci -vv, the NVIDIA device is recognized by the system. Is the Re-Sizise BAR enabled inecessary? Any help would be greatly appreciated! |
Hi Docter,may u got some idea about P2P working on 4090 48G?I trid,but just failed..seems like the bar1 size just can be set like 32G,But 4090 48G may need bar 1 size up to 48G? |
NVIDIA Open GPU Kernel Modules Version
565.57.01-p2p
Please confirm this issue does not happen with the proprietary driver (of the same version). This issue tracker is only for bugs specific to the open kernel driver.
Operating System and Version
ubuntu 22.04
Kernel Release
NA
Please confirm you are running a stable release kernel (e.g. not a -rc). We do not accept bug reports for unreleased kernels.
Hardware: GPU
RTX 5090
Describe the bug
It would be great to enable P2P on RTX 5090 since it's supporting PCIE-5 which will be twice as fast compared to PCIE4. In fact PCIE-5 could reach the speed of first generation NVLinK. Thus supporting P2P for RTX 5090 could theoretically scale to 8 GPUs without much performance degradtion for distributed training / tensor-parallelism inference.
To Reproduce
Unable to support P2P on RTX 5090 at the moment.
Bug Incidence
Always
nvidia-bug-report.log.gz
NA
More Info
No response
The text was updated successfully, but these errors were encountered: