Talk About AWS GB200 Instance

Table of Contents

AWS-customized GB200 has many unique designs that align with cloud business requirements. Significant consideration has been given to aspects like PCIe interconnects, rack reliability, and heat dissipation. There is also a highly specific design for the convergence of ScaleOut and FrontEnd networks, which we will discuss in detail.

Let’s Discuss NPUs
#

In reality, whether it’s Google TPU, AWS Trainium, or even Huawei’s Ascend, these NPUs present challenges for the IaaS delivery interface of GPU clouds.

The essence of the cloud is the securitization of computing power. Any non-standard, over-the-counter transactions lead to significant cost impacts. One impact is the technical debt caused by adaptation, particularly issues arising from the long-term evolution of training and inference frameworks. The other is that the pricing of non-standard products involves too many bargaining issues and internal customer cost accounting problems, which affects their liquidity (elasticity). This prevents users from utilizing them flexibly on an on-demand basis.

When CUDA becomes a de facto IaaS delivery interface, it might be a wiser choice for other self-developed XPUs to ensure some compatibility with PTX instructions.

The abstraction of SIMT itself is quite elegant, but adding TC does increase some complexity. One approach is adding DSA starting from SIMT. The other, taking a slight step back, is adding some SIMT frontends on top of DSA. Does this work? And does it satisfy the habits of some operator developers? Of course, there are still many issues to address regarding memory hierarchy and warp scheduling.

One approach is gradually adding DSA like TC from SIMT and then abstracting based on Task-Based Tensor. The other is moving from DSA to adding SIMT frontends that are compatible with user habits and programming requirements. Both approaches feel like they lead to the same destination.

Overview of AWS Blackwell Release
#

AWS’s official article on the release of Blackwell yesterday, “AWS AI infrastructure with NVIDIA Blackwell: Two powerful compute solutions for the next frontier of AI,” also explains much of the operational logic of GPU clouds.

From the CPU ecosystem perspective, the de facto IaaS delivery interface is the X86 instruction set. Although the ARM ecosystem is gradually flourishing, many workloads are still on X86. AWS discussed this issue when explaining why they offer B200 in addition to GB200. This is also the logic behind delivering standardized computing power focused on the ecosystem in the cloud.

A long subsequent section elaborates on “Innovation built on AWS core strengths,” which refers to innovation based on AWS’s core advantages. This mainly revolves around the following points:

Security/Stability (Robust instance security and stability)
Performance/Scale/Elasticity (Reliable performance at massive scale)
Efficiency/Cost (Infrastructure efficiency)

AWS GB200 Rack Architecture
#

The AWS GB200 rack is a completely customized structure. Unlike the standard NVL72 single rack, AWS uses a dual NVL36 dual-rack architecture.

Compared to a single NVL72 rack, the advantages are that the complexity of the CableTray is reduced by half and the blast radius during failure is reduced by half. The sales specifications can also be split into 36-card and 72-card options. When hardware fails, it affects at most one rack, which can significantly increase the sales rate and reduce repair downtime.

There is no independent CDU on the entire rack. Each ComputeTray uses a 2U height. A single rack includes 9 ComputeTrays and 9 SwitchTrays. The two SwitchTrays are connected back-to-back using copper cables.

AWS GB200 ComputeTray Architecture
#

AWS has modified many aspects of the standard ComputeTray. Specifically, ScaleOut and FrontEnd networks are merged onto 8 400Gbps Nitro cards. In reality, a single ComputeTray provides 3.2Tbps of bandwidth, equivalent to a GB200 version configured with 4 CX8s.

The standard ComputeTray is configured with 2 Grace and 4 B200 GPUs, along with a BF3 DPU and 4 CX7 connections, as shown below:

The PCIe topology of the ComputeTray can be seen in the “NVIDIA GB200 NVL Multi-Node Tuning Guide,” as follows:

Blackwell is connected to Grace via a PCIeGen6 x16 link, and both BF3 and CX7 are connected to Grace. The GB200 theoretically supports 800Gbps ScaleOut per card via CX-8, as shown below:

AWS, however, adopted a configuration consisting of 9 400Gbps Nitro cards, as shown below:

The PCIe topology comes from the document “Maximize network bandwidth on Amazon EC2 instances with multiple network cards,” as follows:

One Nitro card uses 200Gbps as a DPU to provide elastic bare metal capabilities. It allocates 100Gbps as the Primary NIC (NCI 0), with the remaining 60Gbps for EBS and 40Gbps for Nitro’s own management. This interface is configured to support ENA only (EFA-SRD is not supported).

Compared to the official CX7 version, PCIe needs to pass through Grace. AWS provides a PCIe Switch that can connect directly to Blackwell. Each 400Gbps Nitro card provides a x16 interface connected to the PCIe Switch and an x8 interface connected to Grace. Therefore, two network cards can be created on a single Nitro: a 400Gbps card for GPU use and a 200Gbps card for Grace use.

Cumulatively, a single Grace can support 4x200Gbps of bandwidth. While a single B200 logically appears to support 2x400G, the 400Gbps connecting the GPU and the 200Gbps connecting the CPU on the same card share a 400Gbps physical network port. Furthermore, there is a passage in the article:

In essence, when NCI1 and NCI3 are both 400Gbps, the B200 cannot run at 800Gbps, but only 400Gbps. The author estimates this is because the current PCIe Switch only supports Gen5, or perhaps there is a temporary speed reduction due to compatibility issues between the PCIe Switch and the B200. It is also possible that Asterlab’s PCIe Gen6 switch chips have not been fully delivered yet, which means the current B200 connection to the PCIe switch only supports Gen and provides 400Gbps capability. It is estimated that a phased upgrade involving replacing the PCIe Switch modules will be implemented later.

It is worth noting that, based on some analysis results simulated through ShallowSim, the best practice for GB200 ComputeTray still requires a full configuration of 4 CX8 versions to match its performance. AWS seems to have already considered this issue in the hardware and elegantly designed the FrontEnd and ScaleOut integrated architecture.

To address the bandwidth contention issue, AWS provided two recommendations: configure the GPU with 4 400Gbps network cards or configure 8 200Gbps network cards. The author believes configuring 8 200Gbps network cards while allocating the remaining 1.6Tbps bandwidth to Grace is a good option, as this significantly benefits scenarios such as KVCache and Agent execution.

AWS GB200 Network
#

First, let’s look at the ScaleUP NVLink network. Unlike the standard single-rack NVL72, AWS uses a dual-rack model. Therefore, there are 16 SwitchTrays, and the SwitchTrays of the two racks are connected back-to-back via external copper cables. Although the cost is slightly higher compared to the single-rack version, it brings an advantage: the cable density of the CableTray in this dual-rack configuration is halved, which significantly improves reliability. Furthermore, the space in a single ComputeTray is larger, which is more conducive to heat dissipation, and the maximum blast radius in case of failure is limited to a single rack of 36 cards.

Next, let’s look at the ScaleOut network and FrontEnd network. AWS has actually fully integrated these two networks, sharing the bandwidth connected to the CPU and GPU on the same 400Gbps Nitro card. The uplink bandwidth of 3.2Tbps for a single ComputeTray is consistent with their specifications on Trainium 2, allowing the entire network to reuse the 10u10p infrastructure.

Another crucial point is that we can see the ScaleOut TORs are placed inside the rack, with 3 units placed at the top and 3 at the bottom. The Nitro cards are connected to the TORs via copper cables, and the TORs uplink via optical fibers. The reliability of the first hop is much higher due to the use of copper cables (MTBF), and the traffic loss impact caused by subsequent TOR uplink optical port failures is also much smaller.

Furthermore, since AWS EFA supports multi-path forwarding using SRD, it is unnecessary to construct a dedicated multi-rail topology. We can also see that each GPU has two Nitro cards carrying traffic, which significantly increases the actual reliability. Even if a single Nitro card fails, 400Gbps ScaleOut capability can still be obtained through the other card.

AWS GB200 Management Node
#

We noticed a special node in the video:

It appears to be a dual-socket X86 server configured with two Nitro network cards. The front also has 2 switch chips providing 24 external ports.

From the actual deployment, 9 NVLink Switches are wired to this node, and at least 7 cables are connected to the left side of this node. Judging by the cable type (especially the connectors), it seems to be a PCIe connector.

It is likely a management node. Some Fabric Manager related software for the NVLink Switch may be remotely extended to this management node via PCIe.

AWS GB200 Thermal Design
#

Another noteworthy aspect is the thermal design of the AWS GB200. It does not use in-rack CDUs, but instead adopts an IRHX (In-Row Heat Exchanger) system that can reuse existing data center infrastructure. The IRHX system circulates cooling liquid near the server rows and uses scalable fan cooling, which also improves water resource utilization.

IRHX is deployed parallel to the compute racks (in-Row):

It consists of three components: Water Distribution Cabinet, Pump Cabinet, and Fan Cabinet.

Specifically, IRHX can add or reduce fan cabinets based on the heat dissipation requirements of the row. Compared to other GPU clouds building new data centers, AWS’s approach minimizes modifications to the infrastructure.

Analysis from a Cloud Perspective
#

We noticed that Dell has already delivered GB300 to Coreweave, while AWS’s GB200 has just launched, relatively speaking, a few months later. However, AWS’s design of the overall system structure has many aspects worth studying.

From the perspective of cloud elastic sales logic, it offers two specifications, 36-card and 72-card, and has a dedicated management node. Smaller specifications may be offered in the future.

Furthermore, compared to the original factory’s high-density deployment of a single NVL72 rack, it uses a dual-rack configuration, resulting in a smaller blast radius during failures. For example, a CableTray failure or NVSwitch failure in a single rack will affect at most 36 cards, while the remaining 36 cards can continue to be used.

The 9 Nitro cards provide 3.4Tbps of bandwidth, of which 3.2Tbps can be used for integrated ScaleOut and FrontEnd purposes. This approach is highly valuable for KVCache and Agent execution during inference. Compared to NV’s storage, which only has a single BF3 400Gbps, Amazon FSx for Lustre provides higher storage capacity.

The integration of Storage/VPC and ScaleOut can be said to be the biggest highlight of AWS GB200. Since EFA SRD supports multi-path forwarding, the switching network does not adopt a multi-rail deployment. Instead, all Nitro cards in a single rack are connected to the TOR via copper cables. The TORs then uplink via multiple optical fibers into the data center network. This improves the MTBF of the first hop significantly because it avoids the optical interconnects of the traditional multi-rail approach. Furthermore, since each GPU is configured with two Nitro cards, if one Nitro fails, the other Nitro can still be used.

The relatively symmetrical CLOS topology is easier to deploy. A small thought exercise: How many rails can be created on the GB200? What are the limitations of PXN? Why did AWS SRD choose this approach?

From a business perspective, AWS not only offers P6e instances for GB200 but also considers scenarios where some user programs’ CPU code still runs on x86, as well as smaller workloads, by introducing P6 instances with 8 B200 cards. They also emphasized hot upgrade capabilities in this announcement. All these points demonstrate the deep thought and customization choices of a mature cloud service provider.

Objectively, AWS also has some shortcomings. For example, EFA-SRD is incompatible with RDMA RC Verbs in the ecosystem, making it difficult for open-source ecosystems like DeepEP/IBGDA to support it. These are areas that need improvement.