In the last 4 articles, we have learned why GPUs are critical for accelerating AI/ML workloads. The discussion of memory latency and the speed of CPU clock ticks and data transfer becomes critical in HPC and GPU-intensive AI/ML training. Utilizing RMDA, whether using Infiniband or RoCE, is a must-have for any AI/ML GPU-to-GPU connectivity. Infiniband and RMDA or RoCE are also must-haves for storage connectivity (NVMe-oF storage fabrics are the optimal choice) for the compute nodes to load in the training data. We will now do a deep dive into the two network design options for supporting multi-node GPU-to-GPU communications and high-speed storage for training data.
Traditional data center networks today
Data center networking integrates resources like load balancing, switching, analytics, and routing to store and process data and applications. It connects computing and storage units within a data center, facilitating efficient internet or network data transmission.
A data center network (DCN) is vital for connecting data across multiple sites, clouds, and private or hybrid environments. Modern DCNs ensure robust security, centralized management, and operational consistency by employing full-stack networking and security virtualization. These networks are crucial in establishing stable, scalable, and secure infrastructure, meeting evolving communication needs, and supporting cloud computing and virtualization.
Traditional Three-Tier Topology Architecture
The traditional data center architecture shown on the left uses a three-layer topology designed for general networks, usually segmented into pods that constrain the location of devices such as virtual servers. The architecture consists of core layer 3 switches, distribution switches (layer 2 or layer 3), and access switches. The access layer runs from the data center to the individual nodes (computers) where users connect to the network. The three-tier network design has been around for 3+ decades and is still prevalent in up to half of today’s data centers. The design is resilient and scalable, and by using MLAG or vPC, it offers high-speed, redundant connectivity to servers and other switches. The challenge is that it is designed for North-South traffic patterns and doesn’t support HPC or AI/ML workloads.
Leaf-Spine Topology Architecture
The leaf-spine topology architecture is a streamlined network design that moves data in an east-west flow. The new topology design has only two layers – the leaf layer and the spine layer. The leaf layer consists of access switches that connect to devices like servers, firewalls, and edge routers. And the spine layer is the backbone of the network, where every leaf switch is interconnected with every spine switch. Using ECMP, the leaf-spine topology allows all connections to run at the same speed, and the mesh of fiber links creates a high-capacity network resource shared with all attached devices.
The new challenge to support AI/ML Workloads
The significant difference between HPC workloads and AI/ML workloads is that AI/ML executes many jobs on all GPU-accelerated compute nodes. In an AI/ML workload, any kind of slowdown in the network will stall ALL of the GPUs based on the slowest path.
With increasing data and model complexities, the time required to train neural networks has become prohibitively large. A typical neural network has hundreds of hidden layers and thousands of neural nodes, each performing a mathematical algorithm on the data from the previous neural node.
A typical neural network comprises inputs, weighting, and a function to calculate these inputs and derive an output. The output may then go on to another node with another algorithmic function.
These neural networks commonly use programming languages such as Python and C/C++ and frameworks such as PyTorch, TensorFlow, and JAX, which are built to take advantage of GPUs natively, and the AI/ML engineer will determine how the nodes are placed across multi-GPU systems using parallelism.
Users are turning to data parallel neural networks (DPNN) and large-scale distributed resources on computer clusters to address the exponential rise in training time. Current DPNN approaches
implement the network parameter updates by synchronizing and averaging across all processes while blocking communication operations after each forward-backward pass.
This synchronization is the central algorithmic bottleneck.
A typical cycling flow pattern of data through the neural network shows the processing performed on the batch. There is a local sync on the physical GPU if multiple GPUs are used (local on the compute node or across many compute nodes with 8 GPUs).
What is important to note is the Global Send to all GPUs and Global Receive from all GPUs. Picture a 128 or 256 GPU system connected in a full mesh. As each GPU does its local sync, a global send of ALL the data on that GPU for that batch job to ALL the other GPUs in the mesh to synchronize. In common with all GPU mesh jobs, all GPUs run their jobs by executing the instructions on the GPU. Once each GPU finishes its job, it notifies all the other GPUs in an all-to-all collective type workload. We now have a full mesh of each GPU sending notifications to all different GPUs in a collective.
The next portion is where the synchronization problem is. All GPUs must synchronize, and each GPU must wait for all other GPUs to complete, which comes from the notification messages. The AI/ML workload computation will stall, waiting for the slowest path. The JCT or Job Completion Time becomes based on the worst-case latency. This is analogous to a TCP flow waiting for the Syn/Syn-Ack before TCP can transmit again. If the buffers get full, a TCP reset is sent to the sending host to retransmit and reduce the TCP window size by 1/2, further increasing latency and JCTs.
To put it in perspective, a typical GPU cluster that we’re seeing our customers looking to deploy at max performance has roughly as much network traffic traversing it every second as there is in all of the internet traffic across America. To understand the economics of an AI data center, know that GPU servers can cost as much as $400,000 each. So, maximizing GPU utilization and minimizing GPU idle time are two of the most critical drivers of AI data center design.
Looking at the traditional DC traffic pattern versus an AI/ML pattern, we see consistent flows during the AI/ML workload job execution and spikes in traffic leading to congestion. This, in turn, causes our jobs to stall in the synchronization phase, waiting on the notifications. Depending on the load balancing algorithms in our DC architecture, we will have hot spots where we have congestion and underutilized links, leading to increased job completion times (JCTs). Besides the memory latency issue, this is the most important thing to understand when designing AI/ML network architectures. Hot spots and congestion cause jobs to stall and JCT to dramatically increase with long-lived flows. In scenarios where multiple high-bandwidth flows are hashed to share the same uplink, they can exceed the available bandwidth of that specific link, leading to congestion issues.
Today’s typical network connecting compute and GPU nodes has two main designs: 3-tier and spine-leaf. They both have the same buffer issue in common: the network switches have ingress and egress buffers to hold the data packet before it is placed on the wire between switches. If the buffers are full, they cannot transmit or receive data, which is dropped. TCP retransmission algorithm will half the packet size and retransmit. This leads to severe inefficiency, and the problem will snowball very quickly. JCTs will increase as all GPUs cannot send or receive on that link.
If we examine a typical 3-tier vPC or MLAG design, we can quickly have an issue with large, long-lived flows. The layer 2 hashing algorithm will place the flow on a single uplink from the sending switch on the TO to the distribution layer switch. The distribution layer switch will put the flow on a single downlink to the receiving TOR. It is pretty easy to see that buffer overruns will occur on these long-lived high throughput flows, causing drops, TCP retransmissions, and output windows to decrease. These all add to latency,y severely impacting JCTs. In this 3-tier design, output queuing techniques can be used to minimize the issues, and moving nodes to other TOR switches may help. Ultimately, this design will not support large-scale AI/ML training using multi-node GPUs.
The preferred network design for AI/ML deep learning is a spine-leaf architecture using Layer 3 ECMP to move the flows across multiple uplinks to multiple spines. While the spine-leaf architecture will alleviate most of the problems, we can still run into ingress buffer overruns. Remember, 1 GPU slowdown affects all of the other GPUs whenever it comes time for a global synchronization. This is the single biggest issue and will lead to increased JCTs.
Distributing the workloads across the GPUs and then synching them to train the AI model requires a new type of network that can accelerate “job completion time” (JCT) and reduce the time that the system is waiting for that last GPU to finish its calculations (“tail latency”).
Therefore, data center networks optimized for AI/ML must have exceptional capabilities around congestion management, load balancing, latency, and, above all else, minimizing JCT. ML practitioners must accommodate more GPUs into their clusters as model sizes and datasets grow. The network fabric should support seamless scalability without compromising performance or introducing communication bottlenecks.
AI/M, L multi-node GPU networking, represents a once-in-a-generation inflection point that will present us with complex technical challenges for years. We are still in the infancy of these designs, and as network architects, we must keep our eyes and ears open for what is coming.
1. High Performance
Maximizing GPU utilization, the overarching economic factor in AI model training requires a network that optimizes for JCT and minimizes tail latency. Faster model training means a shorter time to results and lower data center costs with better optimized compute resources.
2. Open Infrastructure
Performance matters, which is why everyone invests in it. Infiband is still the leader in providing optimized AI/ML GPU-multi-node networks. However, Infiniband is a back-end network, and you must also have the front-end ethernet network for server connectivity. At some point, economics takes over as first, Infiniband is very expensive per port, and secondly, you have two networks to support your front and back end Infiniband. And economics is driven by competition, and competition is driven by openness. We’ve seen this play out in our industry before. And if I am a betting man, I am betting that Ethernet wins. Again. An open platform maximizes innovation. It’s not that proprietary technologies don’t have their roles to play, but seldom does a single purveyor of technology out-innovate the rest of the market. And it simply never happens in environments with so much at stake.
3. Experience-first Operations
Data center networks are becoming increasingly complex, and new protocols must be added to the fabric to meet AI workload performance demands. While complexity will continue increasing, intent-based automation shields the network operator from that complexity. Visibility is critical in an AI/ML training infrastructure, and the ability to see hot spots and congestion is critically important to monitor and manage.