Skip to content

Script on Fedora 41, with RTX 4090/4090/A6000 with 570.133.04 (and 5090) #1

Closed
@Panchovix

Description

@Panchovix

NVIDIA Open GPU Kernel Modules Version

570.133.07

Operating System and Version

Fedora Linux 41 (KDE Plasma)

Kernel Release

Linux fedora 6.13.9-200.fc41.x86_64 #1 SMP PREEMPT_DYNAMIC Sat Mar 29 01:29:31 UTC 2025 x86_64 GNU/Linux

Please confirm you are running a stable release kernel (e.g. not a -rc). We do not accept bug reports for unreleased kernels.

  • I am running on a stable kernel release.

Build Command

Used the ./install.sh file

Terminal output/Build Log

TODO

More Info

Hi there, thanks for your great work! I have tried to apply the patch from the install.sh file, but it seems it's not working so far.

I have a Gigabyte X670 Aorus Master and a Ryzen 7 7800X3D, and disabled IOMMU from BIOS. Can't directly disable ACS, so added pcie_acs_override=downstream into the grub

BOOT_IMAGE=(hd9,gpt2)/vmlinuz-6.13.9-200.fc41.x86_64 root=UUID=20305397-eb89-422b-83f6-af7d095f7768 ro rootflags=subvol=root rhgb quiet iommu=off pcie_acs_override=downstream nvidia-drm.modeset=1 nvidia-drm.fbdev=1

The 5090 is not direcly applicable, but applied tinygrad#29 (comment)

I have done the next tests

pancho@fedora:~/cuda-samples/Samples/1_Utilities/deviceQuery$ ./deviceQuery
./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 4 CUDA Capable device(s)

Device 0: "NVIDIA GeForce RTX 4090"
  CUDA Driver Version / Runtime Version          12.8 / 12.8
  CUDA Capability Major/Minor version number:    8.9
  Total amount of global memory:                 24092 MBytes (25262096384 bytes)
  (128) Multiprocessors, (128) CUDA Cores/MP:    16384 CUDA Cores
  GPU Max Clock rate:                            2625 MHz (2.62 GHz)
  Memory Clock rate:                             10501 Mhz
  Memory Bus Width:                              384-bit
  L2 Cache Size:                                 75497472 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total shared memory per multiprocessor:        102400 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  1536
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Managed Memory:                Yes
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 2 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

Device 1: "NVIDIA GeForce RTX 4090"
  CUDA Driver Version / Runtime Version          12.8 / 12.8
  CUDA Capability Major/Minor version number:    8.9
  Total amount of global memory:                 24092 MBytes (25262096384 bytes)
  (128) Multiprocessors, (128) CUDA Cores/MP:    16384 CUDA Cores
  GPU Max Clock rate:                            2520 MHz (2.52 GHz)
  Memory Clock rate:                             10501 Mhz
  Memory Bus Width:                              384-bit
  L2 Cache Size:                                 75497472 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total shared memory per multiprocessor:        102400 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  1536
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Managed Memory:                Yes
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 20 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

Device 2: "NVIDIA GeForce RTX 5090"
  CUDA Driver Version / Runtime Version          12.8 / 12.8
  CUDA Capability Major/Minor version number:    12.0
  Total amount of global memory:                 32117 MBytes (33677180928 bytes)
  (170) Multiprocessors, (128) CUDA Cores/MP:    21760 CUDA Cores
  GPU Max Clock rate:                            2655 MHz (2.65 GHz)
  Memory Clock rate:                             14001 Mhz
  Memory Bus Width:                              512-bit
  L2 Cache Size:                                 100663296 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total shared memory per multiprocessor:        102400 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  1536
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Managed Memory:                Yes
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 1 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

Device 3: "NVIDIA RTX A6000"
  CUDA Driver Version / Runtime Version          12.8 / 12.8
  CUDA Capability Major/Minor version number:    8.6
  Total amount of global memory:                 48550 MBytes (50908823552 bytes)
  (084) Multiprocessors, (128) CUDA Cores/MP:    10752 CUDA Cores
  GPU Max Clock rate:                            1800 MHz (1.80 GHz)
  Memory Clock rate:                             8001 Mhz
  Memory Bus Width:                              384-bit
  L2 Cache Size:                                 6291456 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total shared memory per multiprocessor:        102400 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  1536
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Managed Memory:                Yes
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 11 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
> Peer access from NVIDIA GeForce RTX 4090 (GPU0) -> NVIDIA GeForce RTX 4090 (GPU1) : No
> Peer access from NVIDIA GeForce RTX 4090 (GPU0) -> NVIDIA GeForce RTX 5090 (GPU2) : No
> Peer access from NVIDIA GeForce RTX 4090 (GPU0) -> NVIDIA RTX A6000 (GPU3) : No
> Peer access from NVIDIA GeForce RTX 4090 (GPU1) -> NVIDIA GeForce RTX 4090 (GPU0) : No
> Peer access from NVIDIA GeForce RTX 4090 (GPU1) -> NVIDIA GeForce RTX 5090 (GPU2) : No
> Peer access from NVIDIA GeForce RTX 4090 (GPU1) -> NVIDIA RTX A6000 (GPU3) : No
> Peer access from NVIDIA GeForce RTX 5090 (GPU2) -> NVIDIA GeForce RTX 4090 (GPU0) : No
> Peer access from NVIDIA GeForce RTX 5090 (GPU2) -> NVIDIA GeForce RTX 4090 (GPU1) : No
> Peer access from NVIDIA GeForce RTX 5090 (GPU2) -> NVIDIA RTX A6000 (GPU3) : No
> Peer access from NVIDIA RTX A6000 (GPU3) -> NVIDIA GeForce RTX 4090 (GPU0) : No
> Peer access from NVIDIA RTX A6000 (GPU3) -> NVIDIA GeForce RTX 4090 (GPU1) : No
> Peer access from NVIDIA RTX A6000 (GPU3) -> NVIDIA GeForce RTX 5090 (GPU2) : No

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 12.8, CUDA Runtime Version = 12.8, NumDevs = 4
Result = PASS
pancho@fedora:~/cuda-samples/Samples/1_Utilities/deviceQuery$ nvidia-smi topo -m
        GPU0    GPU1    GPU2    GPU3    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      PHB     PHB     PHB     0-15    0               N/A
GPU1    PHB      X      PHB     PHB     0-15    0               N/A
GPU2    PHB     PHB      X      PHB     0-15    0               N/A
GPU3    PHB     PHB     PHB      X      0-15    0               N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks
pancho@fedora:~/cuda-samples/Samples/1_Utilities/deviceQuery$ cd ~
mkdir -p cuda_p2p_test
cd cuda_p2p_test

cat > p2p_check.cu << 'EOL'
#include <stdio.h>
#include <cuda_runtime.h>

int main() {
    int deviceCount;
    cudaGetDeviceCount(&deviceCount);
    printf("CUDA Device Count: %d\n", deviceCount);

    for (int i = 0; i < deviceCount; i++) {
        cudaDeviceProp prop;
        cudaGetDeviceProperties(&prop, i);
        printf("Device %d: %s\n", i, prop.name);
    }

    printf("\nP2P Connectivity Matrix:\n");
    for (int i = 0; i < deviceCount; i++) {
        for (int j = 0; j < deviceCount; j++) {
            if (i == j) {
                printf("  X");
            } else {
                int canAccessPeer;
                cudaDeviceCanAccessPeer(&canAccessPeer, i, j);
                printf("  %d", canAccessPeer);
            }
        }
        printf("\n");
    }

    return 0;
}
EOL

nvcc -o p2p_check p2p_check.cu
./p2p_check
nvcc warning : Support for offline compilation for architectures prior to '<compute/sm/lto>_75' will be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning).
CUDA Device Count: 4
Device 0: NVIDIA GeForce RTX 4090
Device 1: NVIDIA GeForce RTX 4090
Device 2: NVIDIA GeForce RTX 5090
Device 3: NVIDIA RTX A6000

P2P Connectivity Matrix:
  X  0  0  0
  0  X  0  0
  0  0  X  0
  0  0  0  X

I think the 5090 is not expected to work, but no luck with the 4090 (both of them)

Do I have to do something different for Fedora, or is not applicable for this OS? Or am I missing disabling an option? I'm new to this, so sorry in advance if I'm missing something. Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions