Description
NVIDIA Open GPU Kernel Modules Version
570.133.07
Operating System and Version
Fedora Linux 41 (KDE Plasma)
Kernel Release
Linux fedora 6.13.9-200.fc41.x86_64 #1 SMP PREEMPT_DYNAMIC Sat Mar 29 01:29:31 UTC 2025 x86_64 GNU/Linux
Please confirm you are running a stable release kernel (e.g. not a -rc). We do not accept bug reports for unreleased kernels.
- I am running on a stable kernel release.
Build Command
Used the ./install.sh file
Terminal output/Build Log
TODO
More Info
Hi there, thanks for your great work! I have tried to apply the patch from the install.sh file, but it seems it's not working so far.
I have a Gigabyte X670 Aorus Master and a Ryzen 7 7800X3D, and disabled IOMMU from BIOS. Can't directly disable ACS, so added pcie_acs_override=downstream into the grub
BOOT_IMAGE=(hd9,gpt2)/vmlinuz-6.13.9-200.fc41.x86_64 root=UUID=20305397-eb89-422b-83f6-af7d095f7768 ro rootflags=subvol=root rhgb quiet iommu=off pcie_acs_override=downstream nvidia-drm.modeset=1 nvidia-drm.fbdev=1
The 5090 is not direcly applicable, but applied tinygrad#29 (comment)
I have done the next tests
pancho@fedora:~/cuda-samples/Samples/1_Utilities/deviceQuery$ ./deviceQuery
./deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 4 CUDA Capable device(s)
Device 0: "NVIDIA GeForce RTX 4090"
CUDA Driver Version / Runtime Version 12.8 / 12.8
CUDA Capability Major/Minor version number: 8.9
Total amount of global memory: 24092 MBytes (25262096384 bytes)
(128) Multiprocessors, (128) CUDA Cores/MP: 16384 CUDA Cores
GPU Max Clock rate: 2625 MHz (2.62 GHz)
Memory Clock rate: 10501 Mhz
Memory Bus Width: 384-bit
L2 Cache Size: 75497472 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total shared memory per multiprocessor: 102400 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 1536
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device supports Managed Memory: Yes
Device supports Compute Preemption: Yes
Supports Cooperative Kernel Launch: Yes
Supports MultiDevice Co-op Kernel Launch: Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 2 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
Device 1: "NVIDIA GeForce RTX 4090"
CUDA Driver Version / Runtime Version 12.8 / 12.8
CUDA Capability Major/Minor version number: 8.9
Total amount of global memory: 24092 MBytes (25262096384 bytes)
(128) Multiprocessors, (128) CUDA Cores/MP: 16384 CUDA Cores
GPU Max Clock rate: 2520 MHz (2.52 GHz)
Memory Clock rate: 10501 Mhz
Memory Bus Width: 384-bit
L2 Cache Size: 75497472 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total shared memory per multiprocessor: 102400 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 1536
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device supports Managed Memory: Yes
Device supports Compute Preemption: Yes
Supports Cooperative Kernel Launch: Yes
Supports MultiDevice Co-op Kernel Launch: Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 20 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
Device 2: "NVIDIA GeForce RTX 5090"
CUDA Driver Version / Runtime Version 12.8 / 12.8
CUDA Capability Major/Minor version number: 12.0
Total amount of global memory: 32117 MBytes (33677180928 bytes)
(170) Multiprocessors, (128) CUDA Cores/MP: 21760 CUDA Cores
GPU Max Clock rate: 2655 MHz (2.65 GHz)
Memory Clock rate: 14001 Mhz
Memory Bus Width: 512-bit
L2 Cache Size: 100663296 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total shared memory per multiprocessor: 102400 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 1536
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device supports Managed Memory: Yes
Device supports Compute Preemption: Yes
Supports Cooperative Kernel Launch: Yes
Supports MultiDevice Co-op Kernel Launch: Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 1 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
Device 3: "NVIDIA RTX A6000"
CUDA Driver Version / Runtime Version 12.8 / 12.8
CUDA Capability Major/Minor version number: 8.6
Total amount of global memory: 48550 MBytes (50908823552 bytes)
(084) Multiprocessors, (128) CUDA Cores/MP: 10752 CUDA Cores
GPU Max Clock rate: 1800 MHz (1.80 GHz)
Memory Clock rate: 8001 Mhz
Memory Bus Width: 384-bit
L2 Cache Size: 6291456 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total shared memory per multiprocessor: 102400 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 1536
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device supports Managed Memory: Yes
Device supports Compute Preemption: Yes
Supports Cooperative Kernel Launch: Yes
Supports MultiDevice Co-op Kernel Launch: Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 11 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
> Peer access from NVIDIA GeForce RTX 4090 (GPU0) -> NVIDIA GeForce RTX 4090 (GPU1) : No
> Peer access from NVIDIA GeForce RTX 4090 (GPU0) -> NVIDIA GeForce RTX 5090 (GPU2) : No
> Peer access from NVIDIA GeForce RTX 4090 (GPU0) -> NVIDIA RTX A6000 (GPU3) : No
> Peer access from NVIDIA GeForce RTX 4090 (GPU1) -> NVIDIA GeForce RTX 4090 (GPU0) : No
> Peer access from NVIDIA GeForce RTX 4090 (GPU1) -> NVIDIA GeForce RTX 5090 (GPU2) : No
> Peer access from NVIDIA GeForce RTX 4090 (GPU1) -> NVIDIA RTX A6000 (GPU3) : No
> Peer access from NVIDIA GeForce RTX 5090 (GPU2) -> NVIDIA GeForce RTX 4090 (GPU0) : No
> Peer access from NVIDIA GeForce RTX 5090 (GPU2) -> NVIDIA GeForce RTX 4090 (GPU1) : No
> Peer access from NVIDIA GeForce RTX 5090 (GPU2) -> NVIDIA RTX A6000 (GPU3) : No
> Peer access from NVIDIA RTX A6000 (GPU3) -> NVIDIA GeForce RTX 4090 (GPU0) : No
> Peer access from NVIDIA RTX A6000 (GPU3) -> NVIDIA GeForce RTX 4090 (GPU1) : No
> Peer access from NVIDIA RTX A6000 (GPU3) -> NVIDIA GeForce RTX 5090 (GPU2) : No
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 12.8, CUDA Runtime Version = 12.8, NumDevs = 4
Result = PASS
pancho@fedora:~/cuda-samples/Samples/1_Utilities/deviceQuery$ nvidia-smi topo -m
GPU0 GPU1 GPU2 GPU3 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X PHB PHB PHB 0-15 0 N/A
GPU1 PHB X PHB PHB 0-15 0 N/A
GPU2 PHB PHB X PHB 0-15 0 N/A
GPU3 PHB PHB PHB X 0-15 0 N/A
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
pancho@fedora:~/cuda-samples/Samples/1_Utilities/deviceQuery$ cd ~
mkdir -p cuda_p2p_test
cd cuda_p2p_test
cat > p2p_check.cu << 'EOL'
#include <stdio.h>
#include <cuda_runtime.h>
int main() {
int deviceCount;
cudaGetDeviceCount(&deviceCount);
printf("CUDA Device Count: %d\n", deviceCount);
for (int i = 0; i < deviceCount; i++) {
cudaDeviceProp prop;
cudaGetDeviceProperties(&prop, i);
printf("Device %d: %s\n", i, prop.name);
}
printf("\nP2P Connectivity Matrix:\n");
for (int i = 0; i < deviceCount; i++) {
for (int j = 0; j < deviceCount; j++) {
if (i == j) {
printf(" X");
} else {
int canAccessPeer;
cudaDeviceCanAccessPeer(&canAccessPeer, i, j);
printf(" %d", canAccessPeer);
}
}
printf("\n");
}
return 0;
}
EOL
nvcc -o p2p_check p2p_check.cu
./p2p_check
nvcc warning : Support for offline compilation for architectures prior to '<compute/sm/lto>_75' will be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning).
CUDA Device Count: 4
Device 0: NVIDIA GeForce RTX 4090
Device 1: NVIDIA GeForce RTX 4090
Device 2: NVIDIA GeForce RTX 5090
Device 3: NVIDIA RTX A6000
P2P Connectivity Matrix:
X 0 0 0
0 X 0 0
0 0 X 0
0 0 0 X
I think the 5090 is not expected to work, but no luck with the 4090 (both of them)
Do I have to do something different for Fedora, or is not applicable for this OS? Or am I missing disabling an option? I'm new to this, so sorry in advance if I'm missing something. Thanks!