Scale-Out Backend Network: Inter-node GPU Communication #
Figure 2 shows a logical view of an AI Training (AIT) cluster, which consists of six nodes, each equipped with four GPUs, for a total of 24 GPUs. Each GPU has a dedicated RDMA-enabled network interface card (NIC) that typically operates at 400 to 800 Gbps.
An RDMA-NIC can directly read from and write to a GPU’s VRAM without intervention from the host CPU or interrupt triggers. In this sense, the RDMA-NIC acts as a hardware accelerator that offloads data transfer operations from the CPU, reducing latency and freeing up compute resources.
GPUs with the same local rank number across different nodes connect to the same rail of the Scale-Out Backend network. For example, all GPUs with a local rank of 0 connect to Rail 0, while GPUs with a local rank of 1 connect to Rail 1.
The Scale-Out Backend network is used for inter-node GPU communication and must support low-latency, lossless RDMA message transmission. Its physical topology depends on the scale and scalability requirements of the implementation. Leaf switches can be dedicated to a single rail or support multiple rails through bundled ports, with each port group mapping one-to-one to a rail. Traffic between rails is typically routed through Spine switches. In larger-scale implementations, the network often uses a routed two-tier (3-stage) Clos topology or a pod-based three-tier (5-stage) topology.
The Scale-Out Backend network is primarily used to transmit the results of neural network activation functions to the next layer during the forward pass and to support collective communication for gradient synchronization during the backward pass. However, the communication patterns between GPUs in different nodes depend on the chosen parallelization strategy.
Traffic on the Scale-Out Backend network is characterized by high latency sensitivity, burstiness, low-entropy flows, and a small number of long-lived Elephant Flows. Since it’s common for link utilization to reach maximum capacity during communication phases, efficient congestion control mechanisms must be implemented.
Scale-Up Network: Intra-node GPU Communication #
Intra-node GPU communication occurs over a high-bandwidth, low-latency Scale-Up network, which often uses technologies like NVIDIA NVLink, NVSwitch, and AMD Infinity Fabric, depending on the GPU vendor and server architecture. There are also standard-based, vendor-agnostic solutions like UES (UEC) and UALink (UA Alliance).
These interconnects form a Scale-Up communication channel that allows GPUs within the same node to exchange data directly, bypassing the host CPU and system memory. Solutions like NVLink offer higher bandwidth and lower latency compared to PCIe-based communication.
In a typical NVLink topology, GPUs are connected in a mesh or fully-connected ring, enabling point-to-point data transfer. In systems equipped with an NVSwitch, all GPUs within a node are interconnected through a centralized switching fabric, which provides uniform access latency and bandwidth for any GPU pair.
Because communication happens directly through GPU interconnects, Scale-Up communication is generally faster and more efficient than inter-node communication via the Scale-Out Backend network.
Frontend Network: User Inference #
The frontend network in modern large-scale AI training clusters typically uses a routed Clos structure to provide scalable and reliable connectivity for user access, orchestration, and inference workloads. The main function of the frontend network is to handle user interactions with deployed AI models and process inference requests.
For multi-tenancy, modern frontend networks often use BGP EVPN as the control plane and VXLAN as the data plane encapsulation mechanism to enable virtual network isolation. Data transfer is typically based on the TCP protocol. Multi-tenancy technology can also create secure, isolated network segments for training job initialization, where GPUs join a job and receive initial model parameters from a primary node.
Unlike the Scale-Out backend, which connects GPUs across nodes with a dedicated RDMA-NIC for each GPU, the frontend network uses shared NICs that typically operate at 100 Gbps.
Traffic on the frontend network is characterized by bursty, irregular communication patterns, primarily consisting of short-lived, high-entropy Mouse Flows involving multiple unique IP and port combinations. These flows are moderately latency-sensitive, especially in interactive inference scenarios. Despite the burstiness, the average link utilization remains relatively low compared to the Scale-Out or Scale-Up fabrics.
Management Network #
The management network is a dedicated or logically isolated network for AI cluster orchestration, control, and management. It provides secure and reliable connectivity between management servers, compute nodes, and auxiliary systems. These auxiliary systems often include time synchronization servers (NTP/PTP), authentication and authorization services (e.g., LDAP or Active Directory), license servers, telemetry collectors, remote management interfaces (e.g., IPMI, Redfish), and configuration automation platforms.
Traffic on the management network is typically low-bandwidth but highly sensitive, requiring robust security policies, high reliability, and low-latency access to ensure stability and operational continuity. It supports management operations such as remote access, configuration changes, service monitoring, and software updates.
To ensure isolation from user, training, and storage traffic, management traffic is often carried over separate physical interfaces or logically isolated using VLANs or VRFs.
Typical use cases include:
- Cluster orchestration and scheduling: Facilitates communication between the orchestration system and compute nodes for job scheduling, resource allocation, and lifecycle management.
- Job initialization and coordination: Handles the exchange of metadata and service coordination needed to boot up distributed training jobs and synchronize GPUs across multiple nodes.
- Firmware and software lifecycle management: Supports remote OS patching, BIOS or firmware upgrades, driver installations, and configuration rollouts.
- Monitoring and telemetry collection: Enables the collection of logs, hardware metrics, software health metrics, and real-time alerts to a centralized observability platform.
- Remote access and troubleshooting: Provides secure access for administrators via SSH, IPMI, or Redfish for diagnostics, configuration, or out-of-band management.
- Security and segmentation: Ensures that the control plane and management traffic remain isolated from data plane workloads, maintaining performance and security boundaries.
The construction of the management network typically focuses on operational stability and fault tolerance. While bandwidth requirements are not high, low latency and high availability are critical for maintaining cluster health and responsiveness.
Storage Network #
The storage network connects compute nodes (including GPUs) to the underlying storage infrastructure that holds training datasets, model checkpoints, and inference data.
Primary use cases include:
- High-performance data access: Streaming large datasets from distributed or centralized storage systems (e.g., NAS, SAN, or parallel file systems like Lustre or GPFS) to GPUs during training.
- Data preprocessing and caching: Supporting fast read/write access for intermediate caching layers and preprocessing pipelines that prepare training data.
- Shared storage for distributed training: Providing a consistent and accessible view of the file system across multiple nodes to facilitate synchronization and checkpointing.
- Model deployment and inference: Delivering trained model files to inference services and storing input/output data for auditing or analysis.
Due to the high capacity and throughput requirements of training data access, the storage network is typically designed to be high-bandwidth, low-latency, and scalable. It can utilize protocols such as NVMe over Fabrics (NVMe-oF), Fibre Channel, or high-speed Ethernet that supports RDMA.