NVIDIA and AMD Submitted Latest Test Report of GPU Performance

In today’s rapidly advancing AI technology landscape, GPU performance has become a key indicator for measuring hardware strength. Recently, NVIDIA and AMD submitted their latest GPU performance results—Blackwell B200 and Instinct MI325X—in the MLPerf Inference v5.0 benchmark test. This test, developed by MLCommons, evaluates AI inference task performance, considering not only raw hardware computing power but also software optimization and support for new AI workloads. The test results show that NVIDIA continues to lead in performance, while AMD demonstrates strong catch-up momentum with its new products.

NVIDIA’s Blackwell B200 performed particularly impressively in this test. Its flagship product, the GB200 NVL72 system, integrates 72 Blackwell GPUs, interconnected via fifth-generation NVLink technology, forming a massive GPU. In the Llama 3.1 405B ultra-large language model test, the GB200 NVL72 achieved a throughput of 869,200 tokens/s, a 30-fold increase compared to the previous generation H200 NVL8 (8 GPU configuration). This leap is attributed to a single GPU performance increase of over 3 times and a 9-fold expansion of the NVLink interconnect domain. Additionally, in the Llama 2 70B Interactive test, the DGX B200 system equipped with 8 B200 GPUs demonstrated 3 times the performance of the H200. This test has stricter latency requirements, with the time to first token (TTFT) reduced by 4.4 times and the time per output token (TPOT) reduced by 5 times, simulating a more real-time user experience.

NVIDIA and AMD GPU Performance Comparison

The specifications of the Blackwell B200 fully explain its performance origins. A single B200 boasts 180 GB of HBM3e memory, a bandwidth of up to 8 TB/s, supports FP4 precision computing, and achieves a peak computing power of 4.5 petaFLOPS (FP8 dense computation). In an 8 GPU configuration, the Llama 2 70B test reached 98,858 tokens/s in offline scenarios and 98,443 tokens/s in server scenarios, far surpassing competitors. This performance is not only suitable for large-scale inference tasks in data centers but also brings new features to the NVIDIA GeForce RTX 50 series, such as support for 4:2:2 video format in Adobe Premiere Pro and Media Encoder, meeting the needs of professional content creators.

Meanwhile, AMD’s Instinct MI325X also performed well in the tests. As the first AI GPU equipped with 256 GB of HBM3e memory, the MI325X leads NVIDIA’s H200 (141 GB) and B200 (180 GB) in memory capacity, with a bandwidth of 6 TB/s. In an 8 GPU configuration, its Llama 2 70B test results were 33,928 tokens/s offline and 30,724 tokens/s server, closely matching the H200’s 34,081 tokens/s and 30,420 tokens/s. This indicates that the MI325X is competitive in memory-intensive tasks, especially suitable for handling large-parameter language models. AMD also revealed that the next-generation MI355X, based on the new CDNA 4 architecture, will be launched in the second half of 2025, with memory increased to 288 GB, support for FP4 and FP6 formats, and a theoretical peak computing power of 9.2 petaFLOPS, reaching 20.8 petaFLOPS in a single system (8 GPUs), directly targeting NVIDIA’s B200.

In the Stable Diffusion XL text-to-image generation task, the Blackwell B200 continued to lead, with an 8 GPU configuration achieving 30.38 samples/s offline and 28.44 samples/s server. In contrast, the MI325X achieved 17.10 samples/s and 16.18 samples/s, lagging behind the B200 but close to the H100 (18.37 samples/s and 16.04 samples/s). The H200, with 19.45 samples/s and 18.30 samples/s, was in the middle, demonstrating NVIDIA’s continuous progress in software optimization—its inference performance has improved by approximately 50% compared to last year.

The competition between the two companies is not only reflected in hardware specifications; software ecosystem support is equally crucial. NVIDIA, with its mature CUDA platform and Triton inference server, has an advantage in performance tuning. For example, Blackwell utilizes the QUASAR quantization system to achieve FP4 precision inference, ensuring high throughput while meeting precision requirements. AMD continuously improves its software stack through the ROCm platform, and the linear scaling capability of the MI325X (performance almost doubles from single GPU to 8 GPUs) proves its optimization effectiveness. In the future, with the arrival of the MI355X, AMD is expected to further narrow the gap with NVIDIA in hardware and software synergy.

From a market perspective, NVIDIA’s Blackwell series has entered mass production, with over 200 configurations available, covering a wide range of scenarios from data centers to edge computing. Systems like the GB200 NVL72 are seen as the core of “AI factories,” continuously transforming massive data into real-time intelligence. AMD, on the other hand, accelerates its catch-up through the strategy of updating the Instinct series annually. The MI325X began shipping in early 2025, and the subsequent MI355X has also been pre-announced.

Behind this GPU performance race is the rapid expansion of AI model sizes and the diversified demands of application scenarios. Whether it’s a giant model like Llama 3.1 405B with 405 billion parameters or a generative task like Stable Diffusion XL, GPU memory capacity, bandwidth, and computing power are decisive factors. The improvement of AI inference performance no longer relies solely on hardware stacking; software optimization and architectural innovation are equally indispensable.

AMD and ASRock Jointly Investigate 9800X3D Failure Causes

4 April 2025·874 words·5 mins

9800X3D AMD ASRock

A New Leap in Windows Performance

4 April 2025·677 words·4 mins

Windows 11 Performance

Intel to Launch Next Gen Mobile Processor With New 18A Process

2 April 2025·738 words·4 mins

Intel 18A Panther Lake

Related