Skip to main content

AI Chip War Google Ironwood TPU AMD MI350 and NVIDIA GB10

·2206 words·11 mins
AI Accelerator Interconnect
Table of Contents

As a bellwether for technology trends in the global chip industry, the 2025 Hot Chips conference brought together top tech companies for a head-to-head showdown. Google, AMD, and NVIDIA showcased their latest AI accelerators and core technologies. From Google’s Ironwood TPU built specifically for large-scale AI inference, to AMD’s MI350 series based on the CDNA 4 architecture and aimed at massive clusters, to NVIDIA’s GB10 chip based on the Blackwell architecture and focused on compact AI workstations—each of these products has a distinct position and technical path. Yet, together, they outline the core development trajectory of the AI hardware field in 2025.


Google Ironwood TPU
#

At the 2025 Hot Chips conference, Google focused on its latest Tensor Processing Unit (TPU), codenamed Ironwood. Officially released months prior, this product is Google’s first TPU designed specifically for large-scale AI inference tasks, a departure from previous TPUs that focused on AI training. When used with high-performance models, it significantly optimizes the performance of Large Language Models (LLMs), Mixture-of-Experts (MoE) models, and inference models.

A single Ironwood node can integrate up to 9,216 chips, providing a peak FP8 precision compute power of 42.5 Exaflops. Despite a power consumption of 10 MW, its energy efficiency is twice that of the previous generation Trillium TPU. However, this chip is for internal use within Google Cloud services only and is not available externally.

The Ironwood TPU incorporates multiple innovative technologies:

  • Massive SuperPod Architecture: It uses optical circuit switches (OCS) to enable inter-rack memory sharing, expanding a single SuperPod from 4,096 chips to 9,216. It also comes equipped with 1.77 PB of directly addressable HBM memory.
  • Improved Energy Efficiency: Google claims a 2x increase in energy efficiency.
  • Enhanced RAS Features: This generation of TPUs places special emphasis on optimizing RAS (Reliability, Availability, and Serviceability) to ensure stable system operation.

AI Chip War

AI Chip War

While TPUv4 used OCS to enable memory sharing for 4,096 chips, Ironwood doubles this scale to 9,216 chips. OCS technology not only supports configuring nodes into different-sized cuboid architectures but also automatically switches during node failures, re-configuring compute resources via a checkpoint-recovery mechanism.

AI Chip War

With 1.77 PB of shared HBM memory, Google has set a new record for shared memory capacity.

AI Chip War

In FP8 precision, Ironwood’s machine learning compute power has made a significant leap.

AI Chip War

Compared to the TPUv4, Ironwood’s energy efficiency has increased by nearly 6 times.

AI Chip War

Ironwood utilizes Google’s third-generation liquid cooling solution, which uses a multi-loop water path design to keep the cold plates clean and prevent blockages.

AI Chip War

Ironwood supports Google’s latest generation SparseCore feature, which can significantly improve the efficiency of embeddings and collectives.

AI Chip War

In addition to energy efficiency, Google also focused on optimizing power stability. Through a co-design of hardware and software, Ironwood can smooth out power consumption fluctuations, reducing the operational complexity for power suppliers. The large-scale deployment of Ironwood is already underway.

AI Chip War

Ironwood’s Architectural Design
#

Google upgraded its System-on-a-Chip (SoC) architecture to break through the limitations of single-wafer size. Each Ironwood chip contains two compute chiplets.

To better meet the high memory demands of LLMs, Ironwood is equipped with eight groups of HBM3e memory, with a total capacity of 192 GB and a memory bandwidth of up to 7.3 TB/s. Ironwood doesn’t just pursue parameter increases; it also integrates more reliability and security features, such as an integrated root of trust, built-in self-test functions, and modules to detect silent data corruption. It can even perform real-time verification of arithmetic operation results while a workload is running.

Google states, “We have integrated a large number of features into the chip, essentially to create the most energy-efficient chip.” AI technology also contributed to the chip’s design. The design and layout optimization of Ironwood’s ALU circuits were done using AI technology, a result of collaboration with the AlphaChip team.

AI Chip War

The TPU architecture features several key upgrades and enhancements: HBM3e memory, interconnect hardware that scales up to 9,216 chips, and a distributed architecture that scales to dozens of SuperPods.

AI Chip War

Ironwood adds several new features to support confidential computing, including an integrated root of trust, secure boot, and secure test/debug.

AI Chip War

The Ironwood uses a tray-based design, with each tray containing four Ironwood TPUs that are also liquid-cooled.

AI Chip War

Each rack can hold 16 TPU trays, corresponding to 64 TPUs, and needs to be paired with 16 CPU host racks. All interconnects within a rack use copper cables, while the connections between racks are made via OCS.

AI Chip War

Optimizing energy efficiency across the entire chain, from the chip to the data center, is crucial. Power consumption is a key factor that constrains overall performance, requiring not only excellent hardware design but also a data center-level power awareness and regulation system to ensure the entire facility runs efficiently.

AI Chip War


AMD MI350
#

AMD has publicly shared details of its new CDNA 4 architecture for the first time. This architecture is the core power source for AMD’s new MI350 series of accelerators. Like the previous MI300, AMD built this “behemoth” chip using 3D chip stacking technology: it stacks up to eight Accelerator Compute Dies (XCDs) on top of two I/O base dies, resulting in a superchip with 185 billion transistors.

AMD MI350

The use of LLMs is exploding, and models are becoming increasingly complex. They are not only growing in size but also demanding longer context lengths for inference.

AMD MI350

To ensure these models run at high performance, they need a greater memory bandwidth and capacity, and energy efficiency is also critical. Moreover, it’s necessary to have the ability to cluster multiple GPUs to support ultra-large-scale models.

AMD MI350

The AMD MI350 series has already been delivered this year, and the company emphasizes that the product is fully on track with its roadmap.

AMD MI350

The MI350 series includes two platforms: MI350X for air-cooled systems and MI355X for liquid-cooled systems.

AMD MI350

With 185 billion transistors, the MI350 continues AMD’s use of chiplet and chip stacking technology. Consistent with the MI300, four compute chiplets are stacked on each base die. The liquid-cooled version has a total board power consumption of 1.4 kW. Currently, the MI350’s I/O base die is manufactured using a 6nm process, while the compute chiplets use TSMC’s latest 3nm N3P process to optimize performance per watt.

AMD MI350

To accommodate the smaller number of base dies in the MI350, AMD adjusted its Infinity Fabric (AMD’s proprietary high-speed interconnect technology). The two-base die design reduces the number of chip-to-chip interconnects and supports wider, lower-frequency die-to-die (D2D) connections, which effectively ensures energy efficiency. Each socket is equipped with seven Infinity Fabric links.

AMD MI350

Overall, the fourth-generation Infinity Fabric has a bandwidth of 2 TB/s more than the third-generation used in the MI300. Additionally, the extra-large memory capacity reduces the number of GPUs needed, thereby lowering synchronization requirements.

AMD MI350

Looking at the cache and storage hierarchy, the MI350’s Local Data Share (LDS) capacity has doubled compared to the MI300.

AMD MI350

Each new, larger I/O base die can accommodate four compute chiplets, and a single MI350 contains eight XCDs. The engine’s maximum frequency is 2.4 GHz, and each XCD is equipped with a 4 MB L2 cache that maintains coherency with other XCDs.

AMD MI350

The CDNA 4 architecture nearly doubles the throughput for various data types and adds hardware support for FP6 and FP4.

AMD MI350

By increasing the math throughput for AI data types by almost two times, AMD claims its performance is more than two times that of competing accelerators.

AMD MI350

AMD MI350

Moving from the chip to a system-level perspective, the MI350 can be configured as a single NUMA domain or two NUMA domains. Accessing HBM connected to another base die incurs some latency loss. The dual-NUMA domain design can limit the memory access range of the XCD to its local memory, which bypasses this issue.

AMD MI350

In addition to memory partitioning, XCDs can be split into multiple compute partitions. All XCDs can be integrated into a single compute domain, or each XCD can be configured as a separate GPU.

AMD MI350

Furthermore, multi-socket systems can integrate up to eight GPUs on a single baseboard, with a fully connected topology via Infinity Fabric. PCIe is used to connect to the host CPU and network cards.

AMD MI350

AMD uses the standard OAM specification to package the MI350 GPU.

AMD MI350

A single universal baseboard can accommodate up to eight OAM modules.

AMD MI350

The MI350X can serve as a “plug-and-play” upgrade for existing air-cooled systems that have MI300 and MI325, without needing large-scale infrastructure changes.

AMD MI350

The liquid-cooled MI355X platform offers higher performance, with a single GPU power consumption of 1.4 kW. It also uses OAM modules but swaps traditional air-cooled heatsinks for more compact direct liquid-cooled cold plates.

AMD MI350

The two MI350 series platforms have the same memory capacity and bandwidth, but their compute performance differs, which is mainly due to their different operating frequencies.

AMD MI350

For ultra-large-scale compute platforms, the liquid-cooled solution can configure 96 or 128 GPUs per rack, while the air-cooled solution supports 64 GPUs per rack.

AMD MI350

If a user needs a complete rack solution, AMD offers a reference rack that features all core chips (GPU, CPU, and scalable network cards) from AMD, allowing for co-optimization of hardware and software.

AMD MI350

AMD’s ROCm software ecosystem is gradually maturing. In the process of improving overall performance, software-level performance gains are just as important as hardware performance improvements.

AMD MI350
AMD MI350
AMD MI350
AMD MI350

AMD again emphasized the reliability of its product roadmap and stated that it will continue to move forward as planned, with next year’s MI400 continuing this strategy.

AMD MI350
AMD MI350

AMD also revealed that the MI400 series, to be launched next year, will improve the performance of cutting-edge AI models by up to 10 times.


NVIDIA GB10
#

NVIDIA did not focus on future hardware at this event but instead provided an in-depth analysis of its latest, already-available hardware: the GB10 System-on-a-Chip (SoC). As the core power behind NVIDIA’s DGX Spark compact workstation (formerly the DIGITS project), the GB10 is a multi-chip, single-package solution for high-performance Arm-based workstations. The GPU chip is based on the Blackwell architecture, while the CPU chip is designed by MediaTek and integrates 20 Arm CPU cores. Both chips use TSMC’s 3nm process, making the GB10 the most advanced product in NVIDIA’s Blackwell family from a technical perspective.

NVIDIA GB10

NVIDIA is fundamentally a GPU-centric company, and the Blackwell architecture is the “soul” of the GB10. Although the architecture has been streamlined into a mini configuration, it retains all core features, including support for FP4 precision.

NVIDIA GB10

The GB10 adds several new technical innovations on top of Blackwell: a low-power chip-to-chip (C2C) interconnect, a unified memory architecture (single physical/logical memory), and a design that integrates both CPU and GPU chips on a 2.5D interposer.

NVIDIA GB10

With 128 GB of unified LPDDR5X system memory, it can support fine-tuning of models with up to 70 billion parameters. Equipped with a ConnectX-7 network card, it supports interconnecting two DGX Spark systems to handle larger-scale models. NVIDIA explicitly states that the DGX Spark is positioned as an entry-level development device. Once model development and testing are completed on the Spark platform, it can be seamlessly deployed to the DGX Cloud. Its biggest highlight is that it can be powered by a standard power outlet, which is a significant advantage compared to server-level equipment.

NVIDIA GB10

The DGX Spark has 20 CPU cores, up to 4 TB of SSD storage, HBM memory, and other configurations.

NVIDIA GB10

Both chips use TSMC’s 3nm process. The GPU supports all new Blackwell features, including DLSS and Ray Tracing, providing 31 TFLOPS of FP32 performance or 1,000 TFLOPS of FP4 performance. The CPU cores are based on the Arm v9.2 architecture (confirmed to be an off-the-shelf core design, but it’s not disclosed whether it’s a Cortex or Neoverse solution). They are divided into two 10-core clusters, with each core having its private L2 cache. The 256-bit L5X-9400 memory interface achieves a memory bandwidth of approximately 301 GB/s.

NVIDIA GB10

The GPU chip supports up to four display outputs (three DisplayPort interfaces + one HDMI 2.1a interface). The GB10’s total thermal design power (TDP) is 140 watts.

NVIDIA GB10

The GPU chip integrates a large 24 MB L2 cache, which also implements CPU/GPU cache coherency. Hardware-level coherency management reduces performance overhead and simplifies the developer’s work. The implementation of Address Translation Services (ATS) allows the entire graphics L2 cache to have physical tags. The operating system recognizes it as a PCIe device, supports SR-IOV virtualization technology, and is equipped with NVDEC/NVENC codec engines.

NVIDIA GB10

Each DGX Spark has a built-in ConnectX-7 network card that supports dual-system interconnection. The SoC to network card uses a PCIe 5.0 x8 back channel (note: this means the network card’s one-way bandwidth is only 200 Gbps, and the two ports cannot run at full speed simultaneously).

NVIDIA GB10

As mentioned earlier, this project was jointly developed by NVIDIA and MediaTek, with the latter providing the CPU chip (S-die). Since the memory controller is integrated into the CPU chip, NVIDIA is highly dependent on MediaTek to deliver a reliable memory subsystem. MediaTek also integrated some of NVIDIA’s IPs, including the display controller and C2C interconnect technology. A lot of verification work was done in the early stages of the project to ensure that the first tape-out was successful without needing design corrections.

NVIDIA GB10

NVIDIA GB10

With the GB10 as the core chip of the DGX Spark, NVIDIA is applying it to compact workstations to support various AI workloads within the CUDA ecosystem. This small chip is enabling the development and validation of large tasks, providing a foundational platform for subsequent large-scale cloud deployment.

Related

AI加速器互连技术
·496 words·3 mins
AI Accelerator Interconnect
NVIDIA Blackwell Ultra: First GPU to Support PCIe 6.0
·483 words·3 mins
NVIDIA GPU PCIe 6.0 Blackwell Ultra AI HPC
AMD Financial Analyst Day 2025: Zen 6, RDNA 5, and AI Roadmap
·605 words·3 mins
AMD Zen 6 Zen 7 RDNA 5 Instinct MI600 AI Financial Analyst Day 2025 GPUs CPUs