As a bellwether for technology trends in the global chip industry, the 2025 Hot Chips conference brought together top tech companies for a head-to-head showdown. Google, AMD, and NVIDIA showcased their latest AI accelerators and core technologies. From Google’s Ironwood TPU built specifically for large-scale AI inference, to AMD’s MI350 series based on the CDNA 4 architecture and aimed at massive clusters, to NVIDIA’s GB10 chip based on the Blackwell architecture and focused on compact AI workstations—each of these products has a distinct position and technical path. Yet, together, they outline the core development trajectory of the AI hardware field in 2025.
Google Ironwood TPU #
At the 2025 Hot Chips conference, Google focused on its latest Tensor Processing Unit (TPU), codenamed Ironwood. Officially released months prior, this product is Google’s first TPU designed specifically for large-scale AI inference tasks, a departure from previous TPUs that focused on AI training. When used with high-performance models, it significantly optimizes the performance of Large Language Models (LLMs), Mixture-of-Experts (MoE) models, and inference models.
A single Ironwood node can integrate up to 9,216 chips, providing a peak FP8 precision compute power of 42.5 Exaflops. Despite a power consumption of 10 MW, its energy efficiency is twice that of the previous generation Trillium TPU. However, this chip is for internal use within Google Cloud services only and is not available externally.
The Ironwood TPU incorporates multiple innovative technologies:
- Massive SuperPod Architecture: It uses optical circuit switches (OCS) to enable inter-rack memory sharing, expanding a single SuperPod from 4,096 chips to 9,216. It also comes equipped with 1.77 PB of directly addressable HBM memory.
- Improved Energy Efficiency: Google claims a 2x increase in energy efficiency.
- Enhanced RAS Features: This generation of TPUs places special emphasis on optimizing RAS (Reliability, Availability, and Serviceability) to ensure stable system operation.
While TPUv4 used OCS to enable memory sharing for 4,096 chips, Ironwood doubles this scale to 9,216 chips. OCS technology not only supports configuring nodes into different-sized cuboid architectures but also automatically switches during node failures, re-configuring compute resources via a checkpoint-recovery mechanism.
With 1.77 PB of shared HBM memory, Google has set a new record for shared memory capacity.
In FP8 precision, Ironwood’s machine learning compute power has made a significant leap.
Compared to the TPUv4, Ironwood’s energy efficiency has increased by nearly 6 times.
Ironwood utilizes Google’s third-generation liquid cooling solution, which uses a multi-loop water path design to keep the cold plates clean and prevent blockages.
Ironwood supports Google’s latest generation SparseCore feature, which can significantly improve the efficiency of embeddings and collectives.
In addition to energy efficiency, Google also focused on optimizing power stability. Through a co-design of hardware and software, Ironwood can smooth out power consumption fluctuations, reducing the operational complexity for power suppliers. The large-scale deployment of Ironwood is already underway.
Ironwood’s Architectural Design #
Google upgraded its System-on-a-Chip (SoC) architecture to break through the limitations of single-wafer size. Each Ironwood chip contains two compute chiplets.
To better meet the high memory demands of LLMs, Ironwood is equipped with eight groups of HBM3e memory, with a total capacity of 192 GB and a memory bandwidth of up to 7.3 TB/s. Ironwood doesn’t just pursue parameter increases; it also integrates more reliability and security features, such as an integrated root of trust, built-in self-test functions, and modules to detect silent data corruption. It can even perform real-time verification of arithmetic operation results while a workload is running.
Google states, “We have integrated a large number of features into the chip, essentially to create the most energy-efficient chip.” AI technology also contributed to the chip’s design. The design and layout optimization of Ironwood’s ALU circuits were done using AI technology, a result of collaboration with the AlphaChip team.
The TPU architecture features several key upgrades and enhancements: HBM3e memory, interconnect hardware that scales up to 9,216 chips, and a distributed architecture that scales to dozens of SuperPods.
Ironwood adds several new features to support confidential computing, including an integrated root of trust, secure boot, and secure test/debug.
The Ironwood uses a tray-based design, with each tray containing four Ironwood TPUs that are also liquid-cooled.
Each rack can hold 16 TPU trays, corresponding to 64 TPUs, and needs to be paired with 16 CPU host racks. All interconnects within a rack use copper cables, while the connections between racks are made via OCS.
Optimizing energy efficiency across the entire chain, from the chip to the data center, is crucial. Power consumption is a key factor that constrains overall performance, requiring not only excellent hardware design but also a data center-level power awareness and regulation system to ensure the entire facility runs efficiently.
AMD MI350 #
AMD has publicly shared details of its new CDNA 4 architecture for the first time. This architecture is the core power source for AMD’s new MI350 series of accelerators. Like the previous MI300, AMD built this “behemoth” chip using 3D chip stacking technology: it stacks up to eight Accelerator Compute Dies (XCDs) on top of two I/O base dies, resulting in a superchip with 185 billion transistors.
The use of LLMs is exploding, and models are becoming increasingly complex. They are not only growing in size but also demanding longer context lengths for inference.
To ensure these models run at high performance, they need a greater memory bandwidth and capacity, and energy efficiency is also critical. Moreover, it’s necessary to have the ability to cluster multiple GPUs to support ultra-large-scale models.
The AMD MI350 series has already been delivered this year, and the company emphasizes that the product is fully on track with its roadmap.
The MI350 series includes two platforms: MI350X for air-cooled systems and MI355X for liquid-cooled systems.
With 185 billion transistors, the MI350 continues AMD’s use of chiplet and chip stacking technology. Consistent with the MI300, four compute chiplets are stacked on each base die. The liquid-cooled version has a total board power consumption of 1.4 kW. Currently, the MI350’s I/O base die is manufactured using a 6nm process, while the compute chiplets use TSMC’s latest 3nm N3P process to optimize performance per watt.
To accommodate the smaller number of base dies in the MI350, AMD adjusted its Infinity Fabric (AMD’s proprietary high-speed interconnect technology). The two-base die design reduces the number of chip-to-chip interconnects and supports wider, lower-frequency die-to-die (D2D) connections, which effectively ensures energy efficiency. Each socket is equipped with seven Infinity Fabric links.
Overall, the fourth-generation Infinity Fabric has a bandwidth of 2 TB/s more than the third-generation used in the MI300. Additionally, the extra-large memory capacity reduces the number of GPUs needed, thereby lowering synchronization requirements.
Looking at the cache and storage hierarchy, the MI350’s Local Data Share (LDS) capacity has doubled compared to the MI300.
Each new, larger I/O base die can accommodate four compute chiplets, and a single MI350 contains eight XCDs. The engine’s maximum frequency is 2.4 GHz, and each XCD is equipped with a 4 MB L2 cache that maintains coherency with other XCDs.
The CDNA 4 architecture nearly doubles the throughput for various data types and adds hardware support for FP6 and FP4.
By increasing the math throughput for AI data types by almost two times, AMD claims its performance is more than two times that of competing accelerators.
Moving from the chip to a system-level perspective, the MI350 can be configured as a single NUMA domain or two NUMA domains. Accessing HBM connected to another base die incurs some latency loss. The dual-NUMA domain design can limit the memory access range of the XCD to its local memory, which bypasses this issue.
In addition to memory partitioning, XCDs can be split into multiple compute partitions. All XCDs can be integrated into a single compute domain, or each XCD can be configured as a separate GPU.
Furthermore, multi-socket systems can integrate up to eight GPUs on a single baseboard, with a fully connected topology via Infinity Fabric. PCIe is used to connect to the host CPU and network cards.
AMD uses the standard OAM specification to package the MI350 GPU.
A single universal baseboard can accommodate up to eight OAM modules.
The MI350X can serve as a “plug-and-play” upgrade for existing air-cooled systems that have MI300 and MI325, without needing large-scale infrastructure changes.
The liquid-cooled MI355X platform offers higher performance, with a single GPU power consumption of 1.4 kW. It also uses OAM modules but swaps traditional air-cooled heatsinks for more compact direct liquid-cooled cold plates.
The two MI350 series platforms have the same memory capacity and bandwidth, but their compute performance differs, which is mainly due to their different operating frequencies.
For ultra-large-scale compute platforms, the liquid-cooled solution can configure 96 or 128 GPUs per rack, while the air-cooled solution supports 64 GPUs per rack.
If a user needs a complete rack solution, AMD offers a reference rack that features all core chips (GPU, CPU, and scalable network cards) from AMD, allowing for co-optimization of hardware and software.
AMD’s ROCm software ecosystem is gradually maturing. In the process of improving overall performance, software-level performance gains are just as important as hardware performance improvements.
AMD again emphasized the reliability of its product roadmap and stated that it will continue to move forward as planned, with next year’s MI400 continuing this strategy.
AMD also revealed that the MI400 series, to be launched next year, will improve the performance of cutting-edge AI models by up to 10 times.
NVIDIA GB10 #
NVIDIA did not focus on future hardware at this event but instead provided an in-depth analysis of its latest, already-available hardware: the GB10 System-on-a-Chip (SoC). As the core power behind NVIDIA’s DGX Spark compact workstation (formerly the DIGITS project), the GB10 is a multi-chip, single-package solution for high-performance Arm-based workstations. The GPU chip is based on the Blackwell architecture, while the CPU chip is designed by MediaTek and integrates 20 Arm CPU cores. Both chips use TSMC’s 3nm process, making the GB10 the most advanced product in NVIDIA’s Blackwell family from a technical perspective.
NVIDIA is fundamentally a GPU-centric company, and the Blackwell architecture is the “soul” of the GB10. Although the architecture has been streamlined into a mini configuration, it retains all core features, including support for FP4 precision.
The GB10 adds several new technical innovations on top of Blackwell: a low-power chip-to-chip (C2C) interconnect, a unified memory architecture (single physical/logical memory), and a design that integrates both CPU and GPU chips on a 2.5D interposer.
With 128 GB of unified LPDDR5X system memory, it can support fine-tuning of models with up to 70 billion parameters. Equipped with a ConnectX-7 network card, it supports interconnecting two DGX Spark systems to handle larger-scale models. NVIDIA explicitly states that the DGX Spark is positioned as an entry-level development device. Once model development and testing are completed on the Spark platform, it can be seamlessly deployed to the DGX Cloud. Its biggest highlight is that it can be powered by a standard power outlet, which is a significant advantage compared to server-level equipment.
The DGX Spark has 20 CPU cores, up to 4 TB of SSD storage, HBM memory, and other configurations.
Both chips use TSMC’s 3nm process. The GPU supports all new Blackwell features, including DLSS and Ray Tracing, providing 31 TFLOPS of FP32 performance or 1,000 TFLOPS of FP4 performance. The CPU cores are based on the Arm v9.2 architecture (confirmed to be an off-the-shelf core design, but it’s not disclosed whether it’s a Cortex or Neoverse solution). They are divided into two 10-core clusters, with each core having its private L2 cache. The 256-bit L5X-9400 memory interface achieves a memory bandwidth of approximately 301 GB/s.
The GPU chip supports up to four display outputs (three DisplayPort interfaces + one HDMI 2.1a interface). The GB10’s total thermal design power (TDP) is 140 watts.
The GPU chip integrates a large 24 MB L2 cache, which also implements CPU/GPU cache coherency. Hardware-level coherency management reduces performance overhead and simplifies the developer’s work. The implementation of Address Translation Services (ATS) allows the entire graphics L2 cache to have physical tags. The operating system recognizes it as a PCIe device, supports SR-IOV virtualization technology, and is equipped with NVDEC/NVENC codec engines.
Each DGX Spark has a built-in ConnectX-7 network card that supports dual-system interconnection. The SoC to network card uses a PCIe 5.0 x8 back channel (note: this means the network card’s one-way bandwidth is only 200 Gbps, and the two ports cannot run at full speed simultaneously).
As mentioned earlier, this project was jointly developed by NVIDIA and MediaTek, with the latter providing the CPU chip (S-die). Since the memory controller is integrated into the CPU chip, NVIDIA is highly dependent on MediaTek to deliver a reliable memory subsystem. MediaTek also integrated some of NVIDIA’s IPs, including the display controller and C2C interconnect technology. A lot of verification work was done in the early stages of the project to ensure that the first tape-out was successful without needing design corrections.
With the GB10 as the core chip of the DGX Spark, NVIDIA is applying it to compact workstations to support various AI workloads within the CUDA ecosystem. This small chip is enabling the development and validation of large tasks, providing a foundational platform for subsequent large-scale cloud deployment.