Skip to content

Support RTX 5090 #29

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
1 of 2 tasks
hxssgaa opened this issue Feb 2, 2025 · 13 comments
Open
1 of 2 tasks

Support RTX 5090 #29

hxssgaa opened this issue Feb 2, 2025 · 13 comments
Labels
bug Something isn't working

Comments

@hxssgaa
Copy link

hxssgaa commented Feb 2, 2025

NVIDIA Open GPU Kernel Modules Version

565.57.01-p2p

Please confirm this issue does not happen with the proprietary driver (of the same version). This issue tracker is only for bugs specific to the open kernel driver.

  • I confirm that this does not happen with the proprietary driver package.

Operating System and Version

ubuntu 22.04

Kernel Release

NA

Please confirm you are running a stable release kernel (e.g. not a -rc). We do not accept bug reports for unreleased kernels.

  • I am running on a stable kernel release.

Hardware: GPU

RTX 5090

Describe the bug

It would be great to enable P2P on RTX 5090 since it's supporting PCIE-5 which will be twice as fast compared to PCIE4. In fact PCIE-5 could reach the speed of first generation NVLinK. Thus supporting P2P for RTX 5090 could theoretically scale to 8 GPUs without much performance degradtion for distributed training / tensor-parallelism inference.

To Reproduce

Unable to support P2P on RTX 5090 at the moment.

Bug Incidence

Always

nvidia-bug-report.log.gz

NA

More Info

No response

@hxssgaa hxssgaa added the bug Something isn't working label Feb 2, 2025
@ilovesouthpark
Copy link

i am also very intersting in this but first all 5090 should be available for us :) if you got one please check if large bar is still there if not it does not make any sense to try.

@ilovesouthpark
Copy link

01:00.0 VGA compatible controller: NVIDIA Corporation Device 2b87 (rev a1) (prog-if 00 [VGA controller])
Memory at 80000000 (32-bit, non-prefetchable) [size=64M]
Memory at 4000000000 (64-bit, prefetchable) [size=32G]
Memory at 4800000000 (64-bit, prefetchable) [size=32M]
5090D
Thanks for sharing from CHH user.

@fulloo5
Copy link

fulloo5 commented Mar 7, 2025

The 5090 seems to have some issues with cuda and NCCL Lib adaptation. At present, it seems to cause illegal memory access error when NCCL P2P runs.

@lambo111-x86
Copy link

Sat Mar 8 06:06:45 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.86.10 Driver Version: 570.86.10 CUDA Version: 12.8 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 5090 D Off | 00000000:8A:00.0 Off | N/A |
| 54% 71C P1 574W / 575W | 4663MiB / 32607MiB | 100% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA GeForce RTX 5090 D Off | 00000000:C3:00.0 Off | N/A |
| 78% 84C P1 575W / 575W | 4621MiB / 32607MiB | 100% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 2334 G /usr/lib/xorg/Xorg 4MiB |
| 0 N/A N/A 3843 G ./GUI.for.SingBox 6MiB |
| 0 N/A N/A 3861 G ...bkit2gtk-4.1/WebKitWebProcess 16MiB |
| 0 N/A N/A 104924 C ...quai/rigel-1.21.0-linux/rigel 4598MiB |
| 1 N/A N/A 2334 G /usr/lib/xorg/Xorg 4MiB |
| 1 N/A N/A 104924 C ...quai/rigel-1.21.0-linux/rigel 4598MiB |
+-----------------------------------------------------------------------------------------+
root@lam-MU72-SU0-00:~# lspci -vvvs c3:00.0|grep -i bar
Vector table: BAR=0 offset=00b90000
PBA: BAR=0 offset=00ba0000
Capabilities: [134 v1] Physical Resizable BAR
BAR 1: current size: 256MB, supported: 64MB 128MB 256MB 512MB 1GB 2GB 4GB 8GB 16GB 32GB
Capabilities: [140 v1] Virtual Resizable BAR
BAR 2: current size: 256MB, supported: 64MB 128MB 256MB 512MB 1GB 2GB 4GB 8GB 16GB 32GB 256TB 512TB 1PB 2PB 4PB 8PB 16PB 32PB 64PB 128PB 256PB 512PB 1EB 2EB 4EB 8EB

I do got 2pcs of 5090D and if anyone can make it P2P~

@LuosjDD
Copy link

LuosjDD commented Mar 14, 2025

The 5090 seems to have some issues with cuda and NCCL Lib adaptation. At present, it seems to cause illegal memory access error when NCCL P2P runs.

I updated the cuda version to 12.8.1, but the simpleP2P and p2pBandwidthLatencyTest still have problems.

@huanzhang12
Copy link

The problem is not about CUDA and NCCL. The driver needs to be modified to fully unlock P2P capability in RTX 5090.

I made a quick fix to the driver to enable P2P support on RTX 5090. In gmmuFieldSetAperture, I change the aperture value to GMMU_APERTURE_SYS_NONCOH when it is set to GMMU_APERTURE_PEER. In nv_ioremap_wc, I force-disabled writing combining to remove a kernel warning about PAT, but this may not be necessary. More investigation is needed to make the patch better, but at least we have something working now.

diff --git a/kernel-open/common/inc/nv-linux.h b/kernel-open/common/inc/nv-linux.h
index e9daf8a9..fa60e5d4 100644
--- a/kernel-open/common/inc/nv-linux.h
+++ b/kernel-open/common/inc/nv-linux.h
@@ -531,6 +531,7 @@ static inline void *nv_ioremap_cache(NvU64 phys, NvU64 size)
 static inline void *nv_ioremap_wc(NvU64 phys, NvU64 size)
 {
     void *ptr = NULL;
+    return nv_ioremap_nocache(phys, size);
 #if IS_ENABLED(CONFIG_INTEL_TDX_GUEST) && defined(NV_IOREMAP_DRIVER_HARDENED_WC_PRESENT)
     ptr = ioremap_driver_hardened_wc(phys, size);
 #elif defined(NV_IOREMAP_WC_PRESENT)
diff --git a/src/nvidia/inc/libraries/mmu/gmmu_fmt.h b/src/nvidia/inc/libraries/mmu/gmmu_fmt.h
index 31ec4249..581dc897 100644
--- a/src/nvidia/inc/libraries/mmu/gmmu_fmt.h
+++ b/src/nvidia/inc/libraries/mmu/gmmu_fmt.h
@@ -342,7 +342,14 @@ gmmuFieldSetAperture
     NvU8                      *pMem
 )
 {
-    nvFieldSetEnum(&pAperture->_enum, value, pMem);
+    GMMU_APERTURE new_value;
+    if (value == GMMU_APERTURE_PEER) {
+        new_value = GMMU_APERTURE_SYS_NONCOH;
+    }
+    else {
+        new_value = value;
+    }
+    nvFieldSetEnum(&pAperture->_enum, new_value, pMem);
 }
 
 /*!

The patch was applied to open driver 570.133.07 (you still need to apply the patch by tinygrad first). In addition, as usually required by P2P, disable IOMMU and PCIe ACS in your BIOS.

Some quick tests:

$ ./p2pBandwidthLatencyTest
[P2P (Peer-to-Peer) GPU Bandwidth Latency Test]
Device: 0, NVIDIA GeForce RTX 5090, pciBusID: 1, pciDeviceID: 0, pciDomainID:0
Device: 1, NVIDIA GeForce RTX 5090, pciBusID: 41, pciDeviceID: 0, pciDomainID:0
Device: 2, NVIDIA GeForce RTX 5090, pciBusID: 81, pciDeviceID: 0, pciDomainID:0
Device: 3, NVIDIA GeForce RTX 5090, pciBusID: c1, pciDeviceID: 0, pciDomainID:0
Device=0 CAN Access Peer Device=1
Device=0 CAN Access Peer Device=2
Device=0 CAN Access Peer Device=3
Device=1 CAN Access Peer Device=0
Device=1 CAN Access Peer Device=2
Device=1 CAN Access Peer Device=3
Device=2 CAN Access Peer Device=0
Device=2 CAN Access Peer Device=1
Device=2 CAN Access Peer Device=3
Device=3 CAN Access Peer Device=0
Device=3 CAN Access Peer Device=1
Device=3 CAN Access Peer Device=2

***NOTE: In case a device doesn't have P2P access to other one, it falls back to normal memcopy procedure.
So you can see lesser Bandwidth (GB/s) and unstable Latency (us) in those cases.

P2P Connectivity Matrix
     D\D     0     1     2     3
     0	     1     1     1     1
     1	     1     1     1     1
     2	     1     1     1     1
     3	     1     1     1     1
Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1      2      3 
     0 1516.99  31.81  31.58  32.20 
     1  32.05 1548.61  32.22  32.30 
     2  32.14  32.41 1550.10  32.40 
     3  32.03  31.76  31.70 1547.03 
Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s)
   D\D     0      1      2      3 
     0 1515.52  56.46  56.44  56.44 
     1  56.46 1536.38  56.47  56.47 
     2  56.46  56.47 1540.97  56.44 
     3  56.46  56.44  56.44 1539.46 
Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1      2      3 
     0 1527.32  33.98  33.93  33.96 
     1  33.97 1540.88  34.25  33.98 
     2  33.96  34.27 1537.85  34.08 
     3  34.07  34.15  34.01 1545.43 
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
   D\D     0      1      2      3 
     0 1527.32 111.38 111.31 111.39 
     1 111.38 1537.07 111.39 111.39 
     2 111.39 111.34 1535.58 111.39 
     3 111.39 111.39 111.39 1535.56 
P2P=Disabled Latency Matrix (us)
   GPU     0      1      2      3 
     0   2.08  12.63  12.69  12.61 
     1  12.60   2.07  12.68  12.64 
     2  12.60  12.56   2.08  12.72 
     3  12.73  12.74  12.65   2.09 

   CPU     0      1      2      3 
     0   2.16   5.65   5.58   5.58 
     1   5.57   1.95   5.53   5.48 
     2   5.47   5.43   1.95   5.49 
     3   5.47   5.41   5.48   1.96 
P2P=Enabled Latency (P2P Writes) Matrix (us)
   GPU     0      1      2      3 
     0   2.10   0.51   0.51   0.49 
     1   0.53   2.08   0.59   0.58 
     2   0.52   0.52   2.08   0.51 
     3   0.59   0.49   0.50   2.07 

   CPU     0      1      2      3 
     0   2.11   1.55   1.55   1.57 
     1   1.61   2.01   1.56   1.61 
     2   1.61   1.57   2.01   1.55 
     3   1.61   1.56   1.62   2.01 

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

NCCL 2.26.2-1:

$ NCCL_P2P_LEVEL=SYS ./all_reduce_perf -b 64M -e 8G -f 2 -g 4
# nThread 1 nGpus 4 minBytes 67108864 maxBytes 8589934592 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid   3743 on huan-genoa-server device  0 [0000:01:00] NVIDIA GeForce RTX 5090
#  Rank  1 Group  0 Pid   3743 on huan-genoa-server device  1 [0000:41:00] NVIDIA GeForce RTX 5090
#  Rank  2 Group  0 Pid   3743 on huan-genoa-server device  2 [0000:81:00] NVIDIA GeForce RTX 5090
#  Rank  3 Group  0 Pid   3743 on huan-genoa-server device  3 [0000:c1:00] NVIDIA GeForce RTX 5090
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
    67108864      16777216     float     sum      -1   2209.2   30.38   45.57      0   2165.7   30.99   46.48      0
   134217728      33554432     float     sum      -1   4387.4   30.59   45.89      0   4290.7   31.28   46.92      0
   268435456      67108864     float     sum      -1   8778.9   30.58   45.87      0   8622.8   31.13   46.70      0
   536870912     134217728     float     sum      -1    17471   30.73   46.09      0    17227   31.16   46.75      0
  1073741824     268435456     float     sum      -1    34876   30.79   46.18      0    34448   31.17   46.75      0
  2147483648     536870912     float     sum      -1    69601   30.85   46.28      0    68874   31.18   46.77      0
  4294967296    1073741824     float     sum      -1   139182   30.86   46.29      0   137664   31.20   46.80      0
  8589934592    2147483648     float     sum      -1   278357   30.86   46.29      0   275357   31.20   46.79      0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 46.4009 
#

@Panchovix
Copy link

Panchovix commented Apr 5, 2025

@huanzhang12 Sorry to bother you, but it is possible to package that into a .run file? I would be glad to test on a 5090. I get a n illegal memory access when trying to use P2P even when using it with cards that aren't Blackwell 2.0 (i.e. 4090 or A6000).

EDIT: Managed to apply it and build the driver, but no luck with cards that aren't the 5090.

pancho@fedora:~/cuda-samples/Samples/0_Introduction/simpleP2P$ ./simpleP2P
[./simpleP2P] - Starting...
Checking for multiple GPUs...
CUDA-capable device count: 4

Checking GPU(s) for support of peer to peer memory access...
> Peer access from NVIDIA GeForce RTX 4090 (GPU0) -> NVIDIA GeForce RTX 4090 (GPU1) : Yes
> Peer access from NVIDIA GeForce RTX 4090 (GPU0) -> NVIDIA GeForce RTX 5090 (GPU2) : No
> Peer access from NVIDIA GeForce RTX 4090 (GPU0) -> NVIDIA RTX A6000 (GPU3) : No
> Peer access from NVIDIA GeForce RTX 4090 (GPU1) -> NVIDIA GeForce RTX 4090 (GPU0) : Yes
> Peer access from NVIDIA GeForce RTX 4090 (GPU1) -> NVIDIA GeForce RTX 5090 (GPU2) : No
> Peer access from NVIDIA GeForce RTX 4090 (GPU1) -> NVIDIA RTX A6000 (GPU3) : No
> Peer access from NVIDIA GeForce RTX 5090 (GPU2) -> NVIDIA GeForce RTX 4090 (GPU0) : No
> Peer access from NVIDIA GeForce RTX 5090 (GPU2) -> NVIDIA GeForce RTX 4090 (GPU1) : No
> Peer access from NVIDIA GeForce RTX 5090 (GPU2) -> NVIDIA RTX A6000 (GPU3) : No
> Peer access from NVIDIA RTX A6000 (GPU3) -> NVIDIA GeForce RTX 4090 (GPU0) : No
> Peer access from NVIDIA RTX A6000 (GPU3) -> NVIDIA GeForce RTX 4090 (GPU1) : No
> Peer access from NVIDIA RTX A6000 (GPU3) -> NVIDIA GeForce RTX 5090 (GPU2) : No

@legezywzh
Copy link

The problem is not about CUDA and NCCL. The driver needs to be modified to fully unlock P2P capability in RTX 5090.

I made a quick fix to the driver to enable P2P support on RTX 5090. In gmmuFieldSetAperture, I change the aperture value to GMMU_APERTURE_SYS_NONCOH when it is set to GMMU_APERTURE_PEER. In nv_ioremap_wc, I force-disabled writing combining to remove a kernel warning about PAT, but this may not be necessary. More investigation is needed to make the patch better, but at least we have something working now.

diff --git a/kernel-open/common/inc/nv-linux.h b/kernel-open/common/inc/nv-linux.h
index e9daf8a9..fa60e5d4 100644
--- a/kernel-open/common/inc/nv-linux.h
+++ b/kernel-open/common/inc/nv-linux.h
@@ -531,6 +531,7 @@ static inline void *nv_ioremap_cache(NvU64 phys, NvU64 size)
static inline void *nv_ioremap_wc(NvU64 phys, NvU64 size)
{
void *ptr = NULL;

  • return nv_ioremap_nocache(phys, size);
    #if IS_ENABLED(CONFIG_INTEL_TDX_GUEST) && defined(NV_IOREMAP_DRIVER_HARDENED_WC_PRESENT)
    ptr = ioremap_driver_hardened_wc(phys, size);
    #elif defined(NV_IOREMAP_WC_PRESENT)
    diff --git a/src/nvidia/inc/libraries/mmu/gmmu_fmt.h b/src/nvidia/inc/libraries/mmu/gmmu_fmt.h
    index 31ec4249..581dc897 100644
    --- a/src/nvidia/inc/libraries/mmu/gmmu_fmt.h
    +++ b/src/nvidia/inc/libraries/mmu/gmmu_fmt.h
    @@ -342,7 +342,14 @@ gmmuFieldSetAperture
    NvU8 *pMem
    )
    {
  • nvFieldSetEnum(&pAperture->_enum, value, pMem);
  • GMMU_APERTURE new_value;
  • if (value == GMMU_APERTURE_PEER) {
  •    new_value = GMMU_APERTURE_SYS_NONCOH;
    
  • }
  • else {
  •    new_value = value;
    
  • }
  • nvFieldSetEnum(&pAperture->_enum, new_value, pMem);
    }

/*!
The patch was applied to open driver 570.133.07 (you still need to apply the patch by tinygrad first). In addition, as usually required by P2P, disable IOMMU and PCIe ACS in your BIOS.

Some quick tests:

$ ./p2pBandwidthLatencyTest
[P2P (Peer-to-Peer) GPU Bandwidth Latency Test]
Device: 0, NVIDIA GeForce RTX 5090, pciBusID: 1, pciDeviceID: 0, pciDomainID:0
Device: 1, NVIDIA GeForce RTX 5090, pciBusID: 41, pciDeviceID: 0, pciDomainID:0
Device: 2, NVIDIA GeForce RTX 5090, pciBusID: 81, pciDeviceID: 0, pciDomainID:0
Device: 3, NVIDIA GeForce RTX 5090, pciBusID: c1, pciDeviceID: 0, pciDomainID:0
Device=0 CAN Access Peer Device=1
Device=0 CAN Access Peer Device=2
Device=0 CAN Access Peer Device=3
Device=1 CAN Access Peer Device=0
Device=1 CAN Access Peer Device=2
Device=1 CAN Access Peer Device=3
Device=2 CAN Access Peer Device=0
Device=2 CAN Access Peer Device=1
Device=2 CAN Access Peer Device=3
Device=3 CAN Access Peer Device=0
Device=3 CAN Access Peer Device=1
Device=3 CAN Access Peer Device=2

***NOTE: In case a device doesn't have P2P access to other one, it falls back to normal memcopy procedure.
So you can see lesser Bandwidth (GB/s) and unstable Latency (us) in those cases.

P2P Connectivity Matrix
     D\D     0     1     2     3
     0	     1     1     1     1
     1	     1     1     1     1
     2	     1     1     1     1
     3	     1     1     1     1
Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1      2      3 
     0 1516.99  31.81  31.58  32.20 
     1  32.05 1548.61  32.22  32.30 
     2  32.14  32.41 1550.10  32.40 
     3  32.03  31.76  31.70 1547.03 
Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s)
   D\D     0      1      2      3 
     0 1515.52  56.46  56.44  56.44 
     1  56.46 1536.38  56.47  56.47 
     2  56.46  56.47 1540.97  56.44 
     3  56.46  56.44  56.44 1539.46 
Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1      2      3 
     0 1527.32  33.98  33.93  33.96 
     1  33.97 1540.88  34.25  33.98 
     2  33.96  34.27 1537.85  34.08 
     3  34.07  34.15  34.01 1545.43 
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
   D\D     0      1      2      3 
     0 1527.32 111.38 111.31 111.39 
     1 111.38 1537.07 111.39 111.39 
     2 111.39 111.34 1535.58 111.39 
     3 111.39 111.39 111.39 1535.56 
P2P=Disabled Latency Matrix (us)
   GPU     0      1      2      3 
     0   2.08  12.63  12.69  12.61 
     1  12.60   2.07  12.68  12.64 
     2  12.60  12.56   2.08  12.72 
     3  12.73  12.74  12.65   2.09 

   CPU     0      1      2      3 
     0   2.16   5.65   5.58   5.58 
     1   5.57   1.95   5.53   5.48 
     2   5.47   5.43   1.95   5.49 
     3   5.47   5.41   5.48   1.96 
P2P=Enabled Latency (P2P Writes) Matrix (us)
   GPU     0      1      2      3 
     0   2.10   0.51   0.51   0.49 
     1   0.53   2.08   0.59   0.58 
     2   0.52   0.52   2.08   0.51 
     3   0.59   0.49   0.50   2.07 

   CPU     0      1      2      3 
     0   2.11   1.55   1.55   1.57 
     1   1.61   2.01   1.56   1.61 
     2   1.61   1.57   2.01   1.55 
     3   1.61   1.56   1.62   2.01 

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

NCCL 2.26.2-1:

$ NCCL_P2P_LEVEL=SYS ./all_reduce_perf -b 64M -e 8G -f 2 -g 4
# nThread 1 nGpus 4 minBytes 67108864 maxBytes 8589934592 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid   3743 on huan-genoa-server device  0 [0000:01:00] NVIDIA GeForce RTX 5090
#  Rank  1 Group  0 Pid   3743 on huan-genoa-server device  1 [0000:41:00] NVIDIA GeForce RTX 5090
#  Rank  2 Group  0 Pid   3743 on huan-genoa-server device  2 [0000:81:00] NVIDIA GeForce RTX 5090
#  Rank  3 Group  0 Pid   3743 on huan-genoa-server device  3 [0000:c1:00] NVIDIA GeForce RTX 5090
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
    67108864      16777216     float     sum      -1   2209.2   30.38   45.57      0   2165.7   30.99   46.48      0
   134217728      33554432     float     sum      -1   4387.4   30.59   45.89      0   4290.7   31.28   46.92      0
   268435456      67108864     float     sum      -1   8778.9   30.58   45.87      0   8622.8   31.13   46.70      0
   536870912     134217728     float     sum      -1    17471   30.73   46.09      0    17227   31.16   46.75      0
  1073741824     268435456     float     sum      -1    34876   30.79   46.18      0    34448   31.17   46.75      0
  2147483648     536870912     float     sum      -1    69601   30.85   46.28      0    68874   31.18   46.77      0
  4294967296    1073741824     float     sum      -1   139182   30.86   46.29      0   137664   31.20   46.80      0
  8589934592    2147483648     float     sum      -1   278357   30.86   46.29      0   275357   31.20   46.79      0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 46.4009 
#

@huanzhang12 while you were testing simpleP2P, did you see any error messages in dmesg like below:
[4115826.157667] NVRM: iovaspaceDestruct_IMPL: 4 left-over mappings in IOVAS 0x3e00
[4115826.254141] NVRM: nvAssertFailedNoLog: Assertion failed: pIOVAS != NULL @ io_vaspace.c:592
[4115826.254149] NVRM: nvAssertFailedNoLog: Assertion failed: pIOVAS != NULL @ io_vaspace.c:601
[4115826.254315] NVRM: nvAssertFailedNoLog: Assertion failed: pIOVAS != NULL @ io_vaspace.c:592
[4115826.254318] NVRM: nvAssertFailedNoLog: Assertion failed: pIOVAS != NULL @ io_vaspace.c:601
[4115826.255854] NVRM: nvAssertFailedNoLog: Assertion failed: pIOVAS != NULL @ io_vaspace.c:592
[4115826.255857] NVRM: nvAssertFailedNoLog: Assertion failed: pIOVAS != NULL @ io_vaspace.c:601
[4115826.255860] NVRM: nvAssertFailedNoLog: Assertion failed: Sysmemdesc outlived its attached pGpu @ mem_desc.c:1509
[4115826.255932] NVRM: nvAssertFailedNoLog: Assertion failed: pIOVAS != NULL @ io_vaspace.c:592
[4115826.255935] NVRM: nvAssertFailedNoLog: Assertion failed: pIOVAS != NULL @ io_vaspace.c:601

@huanzhang12
Copy link

@legezywzh I saw that error message with 575 driver. The driver structure changed and a new patch would be needed.
I am currently using 570.133.20 where the patch should still work.

@legezywzh
Copy link

@legezywzh I saw that error message with 575 driver. The driver structure changed and a new patch would be needed. I am currently using 570.133.20 where the patch should still work.

@huanzhang12 Thanks for the quick reply. So with 570.133.20, you don't see these error messages?

@huanzhang12
Copy link

@huanzhang12 Thanks for the quick reply. So with 570.133.20, you don't see these error messages?

No errors on the 570 branch. I upgraded to 570.144 and it worked fine too.

@xxxzsgxxx
Copy link

The problem is not about CUDA and NCCL. The driver needs to be modified to fully unlock P2P capability in RTX 5090.

I made a quick fix to the driver to enable P2P support on RTX 5090. In gmmuFieldSetAperture, I change the aperture value to GMMU_APERTURE_SYS_NONCOH when it is set to GMMU_APERTURE_PEER. In nv_ioremap_wc, I force-disabled writing combining to remove a kernel warning about PAT, but this may not be necessary. More investigation is needed to make the patch better, but at least we have something working now.

diff --git a/kernel-open/common/inc/nv-linux.h b/kernel-open/common/inc/nv-linux.h
index e9daf8a9..fa60e5d4 100644
--- a/kernel-open/common/inc/nv-linux.h
+++ b/kernel-open/common/inc/nv-linux.h
@@ -531,6 +531,7 @@ static inline void *nv_ioremap_cache(NvU64 phys, NvU64 size)
static inline void *nv_ioremap_wc(NvU64 phys, NvU64 size)
{
void *ptr = NULL;

  • return nv_ioremap_nocache(phys, size);
    #if IS_ENABLED(CONFIG_INTEL_TDX_GUEST) && defined(NV_IOREMAP_DRIVER_HARDENED_WC_PRESENT)
    ptr = ioremap_driver_hardened_wc(phys, size);
    #elif defined(NV_IOREMAP_WC_PRESENT)
    diff --git a/src/nvidia/inc/libraries/mmu/gmmu_fmt.h b/src/nvidia/inc/libraries/mmu/gmmu_fmt.h
    index 31ec4249..581dc897 100644
    --- a/src/nvidia/inc/libraries/mmu/gmmu_fmt.h
    +++ b/src/nvidia/inc/libraries/mmu/gmmu_fmt.h
    @@ -342,7 +342,14 @@ gmmuFieldSetAperture
    NvU8 *pMem
    )
    {
  • nvFieldSetEnum(&pAperture->_enum, value, pMem);
  • GMMU_APERTURE new_value;
  • if (value == GMMU_APERTURE_PEER) {
  •    new_value = GMMU_APERTURE_SYS_NONCOH;
    
  • }
  • else {
  •    new_value = value;
    
  • }
  • nvFieldSetEnum(&pAperture->_enum, new_value, pMem);
    }

Thanks for your patch !!

I am using the patched driver 570.133.07 on an AMD Turin CPU with RTX 5090. When I disable the Resizable BAR in the BIOS, turn off IOMMU, and disable PCIe ACS, I can see P2P working successfully. However, when running some tests, CUDA errors occur.

Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s)
Cuda failure p2pBandwidthLatencyTest.cu:189: 'mapping of buffer object failed'

But when I enable Resizable BAR again, keeping all other settings the same as before, the nvidia-smi tool fails to recognize the device, showing "No devices were found." However, when checking with lspci -vv, the NVIDIA device is recognized by the system.

Is the Re-Sizise BAR enabled inecessary?

Any help would be greatly appreciated!

@lambo111-x86
Copy link

@huanzhang12 Thanks for the quick reply. So with 570.133.20, you don't see these error messages?

No errors on the 570 branch. I upgraded to 570.144 and it worked fine too.

Hi Docter,may u got some idea about P2P working on 4090 48G?I trid,but just failed..seems like the bar1 size just can be set like 32G,But 4090 48G may need bar 1 size up to 48G?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

9 participants