NVIDIA Engineer Fixes an AMD Bug

Recently, an Nvidia engineer submitted a patch to the Linux kernel, successfully resolving a performance degradation issue affecting AMD Radeon integrated and dedicated GPUs in the latest kernel versions. This incident stemmed from a previous code adjustment by the engineer, who intended to expand the PCI BAR space to over 10TiB but inadvertently triggered a performance bottleneck. The root cause of the problem was quickly identified and fixed, showcasing the efficient collaboration within the open-source community.

Here’s the background: Last week, the engineer attempted to optimize the Linux kernel’s memory management by increasing the PCI BAR (Base Address Register) space limit from the default value to 10TiB, to accommodate large memory systems. However, this change unintentionally reduced the KASLR entropy on consumer-grade x86 devices. KASLR (Kernel Address Space Layout Randomization) is a security mechanism that enhances system protection by randomizing the kernel’s loading location. This adjustment reduced randomness and artificially extended the kernel’s accessible memory range (direct_map_physmem_end) to 64TiB.

In the Linux system, memory is divided into multiple regions, with the “zone device” specifically designated for hardware like GPUs. When the kernel initializes this region for Radeon GPUs, a critical variable, max_pfn—representing the total RAM addressable by the kernel—was incorrectly set to 64TiB. Since ordinary GPUs cannot actually access such a vast memory range, the system marked dma_addressing_limited() as “True,” forcing the GPU to use the DMA32 zone. This zone only provides 4GB of memory, significantly reducing data transfer efficiency and directly impacting the performance of games and other graphics-intensive applications, such as reduced frame rates and increased latency.

The open-source community discovered the problem. Users running early versions of Linux 6.15 noticed abnormal AMD GPU performance, particularly with KASLR enabled, where game performance degraded significantly. After investigation, it was determined that this was directly related to the code changes submitted by the Nvidia engineer. Notably, the engineer did not shirk responsibility but promptly submitted a fix patch, restoring the normal level of KASLR entropy and ensuring that the GPU was no longer limited by the inefficient DMA32 zone.

This isn’t the first time memory management adjustments have caused GPU issues in the Linux kernel. Back in 2023, with Linux version 6.4, some AMD RX 6000 series graphics card users reported similar performance degradation, which was ultimately resolved by kernel rollback or display setting adjustments. The timeliness of this fix is particularly crucial because the merge window for Linux 6.15-rc1 closed on April 6. The fix patch has been included in the pull request and is expected to take effect in the official release. According to the Linux kernel development cycle, it typically takes 6 to 8 weeks for a new version to go from release candidate to stable release, meaning that the Linux 6.15 stable version is expected to be released in late May or early June, at which point AMD GPU users will be completely free from this issue.

From a technical perspective, the intended purpose of expanding the PCI BAR space was to support high-end servers and professional workstations, which often have tens of terabytes of memory and multiple GPUs. However, on consumer-grade hardware, this adjustment exposed compatibility issues between the kernel and GPU drivers. After the fix, the PCI BAR space limit adjustment is more flexible, only taking effect when the CONFIG_PCI_P2PDMA option is enabled, avoiding unnecessary impact on ordinary users. Additionally, Linux 6.15 will bring other hardware support optimizations, such as AMD RDNA3.5 architecture’s GC 11.5.2 and 11.5.3 modules, and preliminary compatibility for the Intel Lunar Lake platform, paving the way for next-generation devices.

This incident reflects the unique charm of the open-source ecosystem. Despite the intense competition between Nvidia and AMD in the GPU market, their technical personnel can collaborate on the same platform through FOSS (Free and Open Source Software). The Nvidia engineer not only introduced the problem but also proactively fixed it, ensuring that AMD users benefited. This “fixing one’s own mistakes” behavior exemplifies the responsible spirit of the open-source community. The Linux kernel, as an open project, accepts contributions from developers worldwide, and every change undergoes rigorous review, ensuring that problems are quickly resolved once exposed.

For AMD GPU users, this fix is significant. Taking the Radeon RX 7900 XT as an example, its gaming performance under Linux previously rivaled Windows, but in early 6.15 kernels, frame rates could drop from a stable 60fps to below 20fps. After the fix, performance will return to normal, ensuring users have a smooth experience on the open-source system. Additionally, Linux 6.15 will support more new hardware, such as the high-precision mode of the AMD Instinct MI350X accelerator, providing stronger computing power for professional users.

Currently, Linux users can experience the fix early by updating to the latest kernel release candidate or wait for the stable release. In any case, this episode once again proves that the collaborative power of the open-source community is sufficient to address complex technical challenges, and that Linux, as a flexible and efficient operating system, continues to evolve, bringing more possibilities to technology enthusiasts and professional users.

NVIDIA Faces Setback as GB300 Sales Fall Short of Expectations

1 April 2025·861 words·5 mins

GB300 NVIDIA

NVIDIA Launches 96GB Version RTX 5090

25 March 2025·854 words·5 mins

NVIDIA RTX Pro 6000 GPU Blackwell 2025 GTC

NVIDIA Is to Launch 5060Ti the Next Generation Mainstream GPU Soon

24 March 2025·691 words·4 mins

5060Ti NVIDIA

Related