Optimizing AI/ML Performance: Cutting-Edge GPU Network Designs-Part 5-Parallel GPU Node Design Challenges for the Network Engineer

In the last 4 articles, we have learned why GPUs are critical for accelerating AI/ML workloads. The discussion of memory latency and the speed of CPU clock ticks and data transfer becomes critical in HPC and GPU-intensive AI/ML training. Utilizing RMDA, whether using Infiniband or RoCE, is a must-have for any AI/ML GPU-to-GPU connectivity. Infiniband and RMDA or RoCE are also must-haves for storage connectivity (NVMe-oF storage fabrics are the optimal choice) for the compute nodes to load in the training data. We will now do a deep dive into the two network design options for supporting multi-node GPU-to-GPU communications and high-speed storage for training data.

Traditional data center networks today

Data center networking integrates resources like load balancing, switching, analytics, and routing to store and process data and applications. It connects computing and storage units within a data center, facilitating efficient internet or network data transmission.

A data center network (DCN) is vital for connecting data across multiple sites, clouds, and private or hybrid environments. Modern DCNs ensure robust security, centralized management, and operational consistency by employing full-stack networking and security virtualization. These networks are crucial in establishing stable, scalable, and secure infrastructure, meeting evolving communication needs, and supporting cloud computing and virtualization.

Traditional Three-Tier Topology Architecture

The traditional data center architecture shown on the left uses a three-layer topology designed for general networks, usually segmented into pods that constrain the location of devices such as virtual servers. The architecture consists of core layer 3 switches, distribution switches (layer 2 or layer 3), and access switches. The access layer runs from the data center to the individual nodes (computers) where users connect to the network. The three-tier network design has been around for 3+ decades and is still prevalent in up to half of today’s data centers. The design is resilient and scalable, and by using MLAG or vPC, it offers high-speed, redundant connectivity to servers and other switches. The challenge is that it is designed for North-South traffic patterns and doesn’t support HPC or AI/ML workloads.

Leaf-Spine Topology Architecture

The leaf-spine topology architecture is a streamlined network design that moves data in an east-west flow. The new topology design has only two layers – the leaf layer and the spine layer. The leaf layer consists of access switches that connect to devices like servers, firewalls, and edge routers. And the spine layer is the backbone of the network, where every leaf switch is interconnected with every spine switch. Using ECMP, the leaf-spine topology allows all connections to run at the same speed, and the mesh of fiber links creates a high-capacity network resource shared with all attached devices.

The new challenge to support AI/ML Workloads

The significant difference between HPC workloads and AI/ML workloads is that AI/ML executes many jobs on all GPU-accelerated compute nodes. In an AI/ML workload, any kind of slowdown in the network will stall ALL of the GPUs based on the slowest path.

With increasing data and model complexities, the time required to train neural networks has become prohibitively large. A typical neural network has hundreds of hidden layers and thousands of neural nodes, each performing a mathematical algorithm on the data from the previous neural node.

A typical neural network comprises inputs, weighting, and a function to calculate these inputs and derive an output. The output may then go on to another node with another algorithmic function.

These neural networks commonly use programming languages such as Python and C/C++ and frameworks such as PyTorch, TensorFlow, and JAX, which are built to take advantage of GPUs natively, and the AI/ML engineer will determine how the nodes are placed across multi-GPU systems using parallelism.

Users are turning to data parallel neural networks (DPNN) and large-scale distributed resources on computer clusters to address the exponential rise in training time. Current DPNN approaches
implement the network parameter updates by synchronizing and averaging across all processes while blocking communication operations after each forward-backward pass.

This synchronization is the central algorithmic bottleneck.

A typical cycling flow pattern of data through the neural network shows the processing performed on the batch. There is a local sync on the physical GPU if multiple GPUs are used (local on the compute node or across many compute nodes with 8 GPUs).

What is important to note is the Global Send to all GPUs and Global Receive from all GPUs. Picture a 128 or 256 GPU system connected in a full mesh. As each GPU does its local sync, a global send of ALL the data on that GPU for that batch job to ALL the other GPUs in the mesh to synchronize. In common with all GPU mesh jobs, all GPUs run their jobs by executing the instructions on the GPU. Once each GPU finishes its job, it notifies all the other GPUs in an all-to-all collective type workload. We now have a full mesh of each GPU sending notifications to all different GPUs in a collective.

The next portion is where the synchronization problem is. All GPUs must synchronize, and each GPU must wait for all other GPUs to complete, which comes from the notification messages. The AI/ML workload computation will stall, waiting for the slowest path. The JCT or Job Completion Time becomes based on the worst-case latency. This is analogous to a TCP flow waiting for the Syn/Syn-Ack before TCP can transmit again. If the buffers get full, a TCP reset is sent to the sending host to retransmit and reduce the TCP window size by 1/2, further increasing latency and JCTs.

To put it in perspective, a typical GPU cluster that we’re seeing our customers looking to deploy at max performance has roughly as much network traffic traversing it every second as there is in all of the internet traffic across America. To understand the economics of an AI data center, know that GPU servers can cost as much as $400,000 each. So, maximizing GPU utilization and minimizing GPU idle time are two of the most critical drivers of AI data center design.

Looking at the traditional DC traffic pattern versus an AI/ML pattern, we see consistent flows during the AI/ML workload job execution and spikes in traffic leading to congestion. This, in turn, causes our jobs to stall in the synchronization phase, waiting on the notifications. Depending on the load balancing algorithms in our DC architecture, we will have hot spots where we have congestion and underutilized links, leading to increased job completion times (JCTs). Besides the memory latency issue, this is the most important thing to understand when designing AI/ML network architectures. Hot spots and congestion cause jobs to stall and JCT to dramatically increase with long-lived flows. In scenarios where multiple high-bandwidth flows are hashed to share the same uplink, they can exceed the available bandwidth of that specific link, leading to congestion issues.

Today’s typical network connecting compute and GPU nodes has two main designs: 3-tier and spine-leaf. They both have the same buffer issue in common: the network switches have ingress and egress buffers to hold the data packet before it is placed on the wire between switches. If the buffers are full, they cannot transmit or receive data, which is dropped. TCP retransmission algorithm will half the packet size and retransmit. This leads to severe inefficiency, and the problem will snowball very quickly. JCTs will increase as all GPUs cannot send or receive on that link.

If we examine a typical 3-tier vPC or MLAG design, we can quickly have an issue with large, long-lived flows. The layer 2 hashing algorithm will place the flow on a single uplink from the sending switch on the TO to the distribution layer switch. The distribution layer switch will put the flow on a single downlink to the receiving TOR. It is pretty easy to see that buffer overruns will occur on these long-lived high throughput flows, causing drops, TCP retransmissions, and output windows to decrease. These all add to latency,y severely impacting JCTs. In this 3-tier design, output queuing techniques can be used to minimize the issues, and moving nodes to other TOR switches may help. Ultimately, this design will not support large-scale AI/ML training using multi-node GPUs.

The preferred network design for AI/ML deep learning is a spine-leaf architecture using Layer 3 ECMP to move the flows across multiple uplinks to multiple spines. While the spine-leaf architecture will alleviate most of the problems, we can still run into ingress buffer overruns. Remember, 1 GPU slowdown affects all of the other GPUs whenever it comes time for a global synchronization. This is the single biggest issue and will lead to increased JCTs.

Distributing the workloads across the GPUs and then synching them to train the AI model requires a new type of network that can accelerate “job completion time” (JCT) and reduce the time that the system is waiting for that last GPU to finish its calculations (“tail latency”).

Therefore, data center networks optimized for AI/ML must have exceptional capabilities around congestion management, load balancing, latency, and, above all else, minimizing JCT. ML practitioners must accommodate more GPUs into their clusters as model sizes and datasets grow. The network fabric should support seamless scalability without compromising performance or introducing communication bottlenecks.

AI/M, L multi-node GPU networking, represents a once-in-a-generation inflection point that will present us with complex technical challenges for years. We are still in the infancy of these designs, and as network architects, we must keep our eyes and ears open for what is coming.

1. High Performance 
Maximizing GPU utilization, the overarching economic factor in AI model training requires a network that optimizes for JCT and minimizes tail latency. Faster model training means a shorter time to results and lower data center costs with better optimized compute resources.

2. Open Infrastructure
Performance matters, which is why everyone invests in it. Infiband is still the leader in providing optimized AI/ML GPU-multi-node networks. However, Infiniband is a back-end network, and you must also have the front-end ethernet network for server connectivity. At some point, economics takes over as first, Infiniband is very expensive per port, and secondly, you have two networks to support your front and back end Infiniband. And economics is driven by competition, and competition is driven by openness. We’ve seen this play out in our industry before. And if I am a betting man, I am betting that Ethernet wins. Again. An open platform maximizes innovation. It’s not that proprietary technologies don’t have their roles to play, but seldom does a single purveyor of technology out-innovate the rest of the market. And it simply never happens in environments with so much at stake.

3. Experience-first Operations
Data center networks are becoming increasingly complex, and new protocols must be added to the fabric to meet AI workload performance demands. While complexity will continue increasing, intent-based automation shields the network operator from that complexity. Visibility is critical in an AI/ML training infrastructure, and the ability to see hot spots and congestion is critically important to monitor and manage.

Optimizing AI/ML Performance: Cutting-Edge GPU Network Designs-Part 7-Cisco’s Blueprint for AI/ML Networking

Cisco Data Center Networking Blueprint for AI/ML Applications

Widely available GPU-accelerated servers, combined with better hardware and popular programming languages like Python and C/C++, along with frameworks such as PyTorch, TensorFlow, and JAX, simplify the development of GPU-accelerated ML applications. These applications serve diverse purposes, from medical research to self-driving vehicles, relying on large datasets and GPU clusters for training deep neural networks. Inference frameworks apply knowledge from trained models to new data, with optimized clusters for performance.

The learning cycles involved in AI workloads can take days or weeks, and high-latency communication between server clusters can significantly impact completion times or result in failure. AI workloads demand low-latency, lossless networks, requiring appropriate hardware, software features, and configurations. Cisco Nexus 9000 switches, with their capabilities and tools like Cisco Nexus Dashboard Insights and Nexus Dashboard Fabric Controller, offer an ideal platform for building a high-performance AI/ML network fabric.

The Network Design parameters

A two-tier, spine-switch-leaf-switch design with Cisco Nexus switches is recommended for building a non-blocking network with low latency and scalability. For a GPU cluster with 1024 GPUs, utilizing 8 GPUs per server, and 2 ports of 100Gbps per server, Cisco Nexus 93600CD-GX switches at the leaf layer and Nexus 9332D-GX2B switches at the spine layer are proposed.

Each Cisco Nexus 93600CD-GX switch has 28 ports of 100G for server connections and 8 uplinks of 400G, ensuring a non-blocking switch. To accommodate 256 server ports, 10 leaf switches are required, providing redundancy by dual-homing servers to two separate leaf switches. The network design also allows for connecting storage devices and linking the AI/ML server cluster to the enterprise network.

The spine switches, Cisco Nexus 9332D-GX2B, each have 20 x 400G ports, leaving 12 ports free for future expansion. Redundancy is achieved with four spine switches. The system is designed to support scalability, ensuring additional leaf switches can be added without compromising the non-blocking nature of the network.

The Cisco Nexus 93600CD-GX leaf switches and Cisco Nexus 9332D-GX2B spine switches have a low latency of 1.5 microseconds, providing a maximum end-to-end latency of ~4.5 microseconds for traffic crossing both leaf and spine switches. Congestion is managed by WRED ECN, preserving RoCEv2 transport latency on endpoints.

The network can easily expand by adding more leaf switches or doubling spine capacity with Cisco Nexus 9364D-GX2A switches. Alternatively, a three-tier design can interconnect multiple non-blocking network fabrics.

Designed for AI/ML workloads with a massively scalable data center (MSDC) approach, the network utilizes BGP as the control plane to Layer 3 leaf switches. For multi-tenant environments, an MP-BGP EVPN VXLAN network can be employed, allowing network separation between tenants. The principles described are suitable for both simple Layer 3 and VXLAN designs.

RoCEv2 as Transport for AI Clusters

RDMA is a widely used technology in high-performance computing and storage networking, offering high throughput and low-latency memory-to-memory communication between compute nodes. It minimizes CPU involvement by offloading the transfer function to network adapter hardware, bypassing the operating system’s network stack and resulting in reduced power requirements.

AI Clusters Require Lossless Networks

For RoCEv2 transport, a network needs to ensure high throughput and low latency while preventing traffic drops during congestion. Cisco Nexus 9000 switches, designed for data center networks, offer low latency and up to 25.6Tbps of bandwidth per ASIC, meeting the demands of AI/ML clusters using RoCEv2 transport. Additionally, these switches provide support and visibility in maintaining a lossless network environment through software and hardware telemetry for ECN and PFC.

Explicit Congestion Notification (ECN)

In scenarios requiring end-to-end congestion information propagation, Explicit Congestion Notification (ECN) is employed for congestion management. The ECN bits are set to 0x11 in the IP header’s Type of Service (TOS) field when congestion is detected, triggering a congestion notification packet (CNP) from the receiver to the sender. Upon receiving the congestion notification, the sender adjusts the flow accordingly. This built-in, efficient process occurs in the data path between two ECN-enabled endpoints.

Nexus 9000 switches, in the case of network congestion, mark packets with ECN bits using a congestion avoidance algorithm. The switch utilizes Weighted Random Early Detection (WRED) on a per-queue basis. WRED sets two thresholds in the queue: a minimum threshold for minor congestion and a maximum threshold. As buffer utilization grows, WRED marks outgoing packets based on drop probability, represented as a percentage of all outgoing packets. If congestion persists beyond the maximum threshold, the switch marks ECN to 0x11 on every outgoing packet, enabling endpoints to detect and respond to congestion.

Hosts 1-4 GPUs are sending RoCE traffic to Host 9’s GPU. The egress traffic on the port connected to Host 9 is oversubscribed, and the buffer on Leaf5 reaches WRED minimum. Leaf5 starts marking several of the packets headed to Host 9 with ECN 0x11 to indicate congestion in the data path. The NIC of Host 9, now starts sending Congestion Notification Protocol (CNP) packets back to the source (Hosts 1-4). As only some of the data packets were marked with congestion-experienced bits, the source reduces traffic throughput for that flow and continues to send packets. If congestion persists and buffer usage exceeds the WRED maximum threshold, the switch marks every packet with congestion-experienced bits.

This results in the sender receiving numerous CNP packets, prompting its algorithm to significantly reduce the data rate transmission towards the destination. This proactive measure helps alleviate congestion, allowing the buffer to gradually drain. As this occurs, the traffic rate should increase until the next congestion signal is triggered. Remember that in order to make this work properly everything in the path MUST support ECN and sending and processing CNP packets to reduce the traffic. Also, the WRED buffers must be configured properly so we don’t mark ECN too early or too late to keep traffic flowing properly.

How PFC Works

Enabling PFC on the Cisco Nexus 9000 switch designates a specific class of service for lossless transport, treating its traffic differently from other classes. Any port configured with PFC is assigned a dedicated no-drop queue and a dedicated buffer. For lossless capabilities, the queue features two thresholds. The xOFF threshold, positioned higher in the buffer, triggers the generation of a PFC frame when reached, and sent back to the traffic source. Once the buffer starts draining and falls below the xON threshold, pause frames are no longer sent, indicating the system’s assessment that congestion has subsided.

Leaf5 receives traffic coming from hosts 1-4. This leads to congestion on the port toward Host9. At this point, the switch uses its dedicated buffer to absorb incoming traffic. The traffic is buffered by Leaf5, and after the PFC xOFF threshold is reached it will send a pause frame to the upstream hop, in this diagram Spine-1. After Spine-1 receives the pause frame, it stops transmitting traffic, which prevents further congestion on the Leaf5. At the same time, Spine-1 starts buffering traffic in the no-drop queue, and after the buffer reaches the xOFF threshold, it sends pause frames down to Leaf1, sending traffic toward the spine switch. This hop-by-hop propagation of pause behavior continues, the leaf switches receive pause frames, and the switches pause transmission to the spine switch and start buffering traffic. This leads to Leaf1 sending pause frames to Host1-4.

After the senders receive a pause frame, the stream slows down. This allows the buffers to be drained on all the switches in the network. After each device reaches the xON threshold, that device stops propagating pause frames. After the congestion is mitigated and pause frames are no longer sent, Host X starts receiving traffic again.

The pause mechanism, triggered by network congestion or an endpoint with PFC, ensures that all devices in the path receive a pause frame before halting transmission to prevent packet drops. In rare instances, a PFC storm may occur if a misbehaving host continuously sends PFC frames, potentially causing network disruption. To address this, the PFC watchdog feature sets an interval to monitor if packets in a no-drop queue drain within a specified time. If the time is exceeded, all outgoing packets on interfaces matching the problematic PFC queue are dropped, preventing a network PFC deadlock.

Remember that in order to make this work properly everything in the path MUST support PFC and the ability to reduce the traffic. Also, the PFC buffers must be configured properly so we don’t enable PFC too early or too late to keep traffic flowing properly.

Using ECN and PFC Together to Build Lossless Ethernet Networks

As demonstrated in previous examples, both ECN and PFC excel at managing congestion independently, but their combined effectiveness surpasses individual performance. ECN reacts initially to alleviate congestion, while PFC acts as a fail-safe to prevent traffic drops if ECN’s response is delayed and buffer utilization increases. This collaborative congestion management approach is known as Data Center Quantized Congestion Notification (DCQCN), developed for RoCE networks.

The synergy between PFC and ECN ensures efficient end-to-end congestion control. During minor congestion with moderate buffer usage, WRED with ECN seamlessly manages congestion. However, severe congestion or microburst-induced high buffer usage triggers PFC intervention for effective congestion management. To optimize the functionality of WRED and ECN, appropriate thresholds must be set. In the example, WRED’s minimum and maximum thresholds are configured to address congestion initially, while the PFC threshold serves as a safety net post-ECN intervention. Both ECN and PFC operate on a no-drop queue, guaranteeing lossless transport.

Using Approximate Fair Drop (AFD)

An alternative method for congestion management involves utilizing advanced QoS algorithms, such as approximate fair drop (AFD) present in Cisco Nexus 9000 switches. AFD intelligently distinguishes high-bandwidth (elephant flows) from short-lived, low-bandwidth flows (mice flows). By identifying the elephant flows, AFD selectively marks ECN bits with 0x11 values, proportionally to the bandwidth utilized. This targeted marking efficiently regulates the flows contributing most to congestion, optimizing performance for minimal latency.

Compared to WRED, AFD offers the advantage of granular flow differentiation, marking only high-bandwidth elephant flows while sparing mice flows to maintain their speed. In AI clusters, allowing short-lived communications to proceed without long data transfers or congestion slowdowns is beneficial. By discerning and regulating elephant flows, AFD ensures faster completion of many transactions while avoiding packet drops and maintaining system efficiency.

Using NDFC to Build Your AI/ML Network

Regardless of the chosen network architecture, whether Layer 3 to the leaf switch or employing a VXLAN overlay, the Cisco Nexus Dashboard Fabric Controller (NDFC), also known as the Fabric Controller service, offers optimal configurations and automation features. With NDFC, the entire network, including QoS settings for PFC and ECN, can be configured rapidly. Additionally, the Fabric Controller service facilitates the automation of tasks such as adding new leaf or spine switches and modifying access port configurations.

To illustrate, consider the creation of a network fabric from the ground up using eBGP for constructing a Layer 3 network. This is achieved by utilizing the BGP fabric template.

To implement Quality of Service (QoS) across the entire fabric, the Advanced tab provides the option to select the template of your choice. This ensures uniform configuration across all switches, enabling consistent treatment of RoCEv2 traffic throughout the fabric. Some configurations need to be deployed using the freeform method, where native CLI commands are directly sent to the switch for setup. In this freeform configuration, the hierarchy and indentation must align with the structure of the running configuration on the switch.

Using Nexus Dashboard Insights to monitor and tune

Cisco Nexus Dashboard Insights offers detailed ECN mark counters at the device, interface, and flow levels. It also provides information on PFC packets issued or received on a per-class-of-service basis. This real-time congestion data empowers network administrators to fine-tune the network for optimal congestion response.

Utilizing the granular visibility of Cisco Nexus Dashboard Insights, administrators can observe drops and adjust and tune WRED or AFD thresholds to eliminate drops under normal traffic conditions—a crucial first step in ensuring effective handling of regular congestion in AI/ML networks. In scenarios of micro-burst conditions, where numerous servers communicate with a single destination, administrators can leverage counter data to refine WRED or AFD thresholds alongside PFC, achieving a completely lossless behavior. Once drops are mitigated, ECN markings and PFC RX/TX counters’ reports assist in further optimizing the system for peak performance.

The operational intelligence engine within Cisco Nexus Dashboard Insights employs advanced alerting, baselining, correlating, and forecasting algorithms, utilizing telemetry data from the networking and compute components. This automation streamlines troubleshooting, facilitating rapid root-cause analysis and early remediation. The unified network repository and compliance rules ensure alignment with operator intent, maintaining the network state.

Summary

The Cisco Data Center Networking Blueprint for AI/ML Applications white paper offers a comprehensive guide to designing and implementing a non-blocking lossless fabric to support RoCEv2 allowing for RMDA GPU-Direct operations.

Utilizing the Cisco Nexus 93600CD-GX leaf switches, as well as the Cisco Nexus 9332D-GX2B spine offers an end-to-end latency of 4.5 us in a spine leaf topology. Utilizing the proper number of uplinks and speeds ensures a non-blocking fabric. Utilizing ECN and PFC to implement Data Center Quantized Congestion Notification (DCQCN) is necessary to create a lossless fabric. Also, we can utilize WRED or the more powerful AFD(switches must support this) to ensure the building blocks to create a lossless fabric.

The use of NEXUS Dashboard Fabric Controller (NDFC) should be utilized for ease of deployment of the fabric as well as the advanced queuing templates available for AI/ML. Finally, once the fabric is up and running, the WRED or AFD queues need to be monitored and tuned to eliminate hot spots of congestion in the fabric. NEXUS Dashboard Insights is the tool of choice for this so we can look at Flow Table Events (FTE) from the NEXUS switches and we can monitor the links and look for increasing ECNs and PFCs and tune the queues appropriately. This will be an iterative process tuning the fabric as well as how the GPUs are used inter-node for training runs.

Optimizing AI/ML Performance: Cutting-Edge GPU Network Designs-Part 6-Infiniband and RoCE primer for the Network Engineer

AI/ML multi-node clusters have stringent infrastructure latency and packet loss requirements. The infrastructure connecting the GPUs compute node plays an important function in making large AI/ML jobs complete more quickly and, if designed correctly, mitigates the risks of large AI/ML jobs failing due to high latency or packet drops. To help network administrators optimize their AI/ML network for best performance as well as predict issues before they become service-impacting, it is imperative that the network system provides a deep level of visibility into the congestion management algorithms as well as the overall network system health. 

Remote direct memory access (RDMA) is a well-known technology used for high-performance computing (HPC) and storage networking environments to bypass CPU and gain direct memeory access. The advantages of RDMA are the high throughput and low latency transfer of information between compute nodes at the memory-to-memory level, without burdening the CPU. This transfer function is offloaded to the network adapter hardware to bypass the operating system software network stack. This technology also has the advantage of reduced power requirements.

AI applications take advantage of–and expect–low latency, lossless networks. To achieve this, network administrators need to deploy the right hardware and software features, along with a configuration that supports AI application needs. AI applications also need networks that can provide visibility into hot spots so they can be tuned as necessary. Finally, AI applications should take advantage of automation frameworks to make sure the entire network fabric is configured correctly and there is no configuration drift.

The network design has a considerable influence on the overall performance of an AI/ML cluster. In addition to the congestion management tools that were described earlier in this document, the network design should (must) provide a non-blocking fabric to accommodate all the throughput that GPU workloads require. This allows for less work needed from congestion management algorithms, (and from manual tuning) which in turn allows for faster completion of AI/ML jobs.

There are several ways to build a non-blocking network, but a two-tier, spine-switch-leaf-switch design provides the lowest latency and scalability. To illustrate this, we will use an example of a company that wants to create a GPU cluster with 1024 GPUs, 128 servers each 8 GPU and 2 x server ports along with storage devices that keep data that needs to be processed by the GPUs.

This means that 256 x server ports are required at the leaf layer for server connectivity. To make this a non-blocking network, the uplinks to the spine switches must have the same bandwidth capacity as the front panel server-facing ports. 

Infiniband overview

As we mentioned earlier;

“The de facto standard of supercomputer and GPU connectivity is InfiniBand and RDMA. While it works perfectly, it is a proprietary solution from Mellanox, acquired by NVidia, the world’s leading GPU maker. Also, Infiniband is a back-end network that connects compute nodes for GPU-GPU clustering using RMDA and storage connectivity. Infiniband also requires a front-end network to provide connectivity to the outside world.”

I can’t stress this enough “While RMDA over Converged Ethernet (RoCE) offers an alternative for Infiniband it requires careful setup, monitoring, and QOS management to prevent hot spots”

So let’s deep dive into Infiniband and see what makes it especially suitable for AI/ML workloads.

Subnet Manager

Infiniband was the first to implement true SDN using Subnet Manager. The InfiniBand Subnet Manager (SM) is a centralized entity running in the system. Each subnet needs one subnet manager to discover, activate, and manage the subnet. Each network requires a Subnet Manager to be running in either the system itself (system-based) or on one of the nodes that is connected to the fabric (host-based). The SM applies network traffic-related configurations such as QoS, routing, and partitioning to the fabric devices. You can view and configure the Subnet Manager parameters via the CLI/WebUI.

Adaptive Routing (AR) enables the switch to select the output port based on the port’s load. It assumes no constraints on output port selection (free AR). The NVIDIA subnet manager (SM) enables and configures the AR mechanism on fabric switches. It scans all the fabric switches and identifies which support AR, then it configures the AR functionality on these switches.

Fast Link Fault Recovery (FLFR) configured on the subnet manager enables the switch to select the alternative output port when the output port provided in the Linear Forwarding Table is not in an Armed/Active state. This mode allows the fastest traffic recovery for switch-to-switch port failures occurring due to link flaps or neighbor switch reboots without SM intervention.
Fault Routing Notification (FRN) enables the switch to report to neighbor switches that an alternative output port for the traffic to a specific destination LID should be selected to avoid sending traffic to the affected switch.

High Bandwidth

GPU-to -GPU connectivity requires very high bandwidth and low latency. With NDR at 400GB shipping today and the Infiniband Trade Association predicting speeds of 3,200GB by 2030 the speeds “should” keep up with Ethernet speeds in the foreseeable future.  InfiniBand’s ultra-low latencies, with measured delays of 600ns end-to-end, accelerate today’s mainstream HPC and AI applications. Most of today’s ethernet cut-through switches can offer a 4us end-to-end making an IB fabric lower latency.

CPU offload, GPU Direct, and RMDA

  As we discussed in part 4, RMDA can be used to bypass the CPU and allow remote applications to communicate directly.

RDMA is a crucial feature of InfiniBand that provides a way for devices to access the memory of other devices directly. With RDMA, data can be transferred between machines without intermediate software or hardware, significantly reducing latency and increasing performance. RDMA in InfiniBand is implemented using a two-step process. First, the sending device sends a message to the receiving device requesting access to its memory. The receiving device then grants access to its memory and returns a letter to the sender. Once access is granted, the sender can read or write directly to the memory of the receiving device.

GPU Direct offered by NVIDIA allows the GPUs to communicate directly inside the node using NVLink or across the Infiniband network using GPU Direct. Other vendors’ solutions must use the MPI libraries to implement GPU Direct but it is supported.

Scalability

Infiniband can scale to 48,00 nodes and beyond using an Infiniband router and multiple subnets

SHARP

SHARP allows the Infiniband network to process data aggregation and reduction operations as it traverses the network, eliminating the need for sending data multiple times between endpoints. This innovative approach not only decreases the amount of data traversing the network, but offers additional benefits, including freeing valuable CPU resources for computation rather than using them to process communication overhead. It also progresses such communication in an asynchronous manner, independent of the host processing state.

Without SHARP, data must be sent from each endpoint to the switch, back down to endpoints where collectives are computed, back up to the switch, and then back to the endpoints. That’s four traversal steps. With SHARP, the collective operations happen in the switch, so the number of traversal steps is halved: from endpoints to the switch and back. This leads to a 2x improvement in bandwidth for such collective operations and a reduction of 7x in MPI all reduce latency. The SHARP technology is integrated into most of the open-source and commercial MPI software packages as well as OpenSHMEM, NCCL, and other IO frameworks.

QOS

The InfiniBand architecture defined by IBTA includes Quality of Service and Congestion Control features that are tailored perfectly to the needs of Data Center Networks. Through the use of features available today in InfiniBand Architecture-based hardware devices, InfiniBand can address QoS requirements in Data Center applications in an efficient and simpler way, better than any other interconnect technology available today.

Unlike in the more familiar Ethernet domain where QoS and associated queuing and congestion handling mechanisms are used to rectify and enhance the best-effort nature of delivery service,

InfiniBand starts with a slightly different paradigm. First of all, the InfiniBand architecture includes QoS mechanisms inherently. What this means is that QoS mechanisms are not extensions as in the case of IEEE 802.3-based Ethernet that only defines a best-effort delivery service. Because of the inherent inclusion of QoS in the base layer InfiniBand specification, there are two levels to QoS implementation in InfiniBand hardware devices:

  1. QoS mechanisms are inherently built into the basic service delivery mechanism supported by the hardware, and
  2. Queuing services and management for prioritizing flows and guaranteeing service levels or bandwidths

InfiniBand is a loss-less fabric, that is, it does not drop packets during regular operation. Packets are dropped only in instances of component failure. As such, the disastrous effects of retries and timeouts on data center applications are non-existent. When starting to play InfiniBand QoS, you will need to understand two basic concepts. Service levels (SLs) and Virtual Lanes (VLs).

Service Level (SL)
A field in the Local Routing Header (LRH) that defines the requested QoS. It permits a packet to operate at one of 16 SLs. The assignment of SLs is a function of each node’s communication manager (CM) and its negotiation with the SM; it does not define any methods for traffic classification in sending nodes. This is left to the implementer.

Virtual Lane (VL)

Virtual lanes provide a mechanism for creating multiple channels within a single physical link. Virtual Lanes enable multiple independent data flows to share the same link and separate buffering and flow control for each flow. A VL Arbiter is used to control the link usage by the appropriate data flow. Virtual lane arbitration is the mechanism an output port utilizes to select from which virtual lane to transmit. IBA specifies a dual-priority weighted round-robin (WRR) scheme.

  • Each VL uses a different buffer
  • In the standard, there are 16 possible VLs.
    • VLs 0-14 –used for traffic (0-7 are the only VLs that are currently implemented by Mellanox switches)
    • VL 15 –used for management traffic only

Packets are assigned to a VL on each IB device they traverse based on their SL, input port, and output port. This mapping allows (among other things) seamless interoperability of devices that support different numbers of VLs. The mapping is architected for very efficient HW implementation.

It’s important to understand that Infiniband is NOT a dead technology and should be evaluated for any HPC and AI/ML multi-node GPU infrastructure.

Ethernet using RMDA over Converged Ethernet(RoCE)

RDMA, a key technology in high-performance computing and storage networking, ensures high throughput and low-latency memory-to-memory information transfer between compute nodes. It offloads the transfer function to network adapter hardware, reducing CPU burden and power requirements. InfiniBand (IB) initially pioneered RDMA, becoming the preferred high-performance computing transport. However, InfiniBand necessitated a separate, costly network for enterprise networks requiring HPC workloads.

RDMA offers various implementations, including Ethernet-based RDMA over Converged Ethernet (RoCE). RoCEv2, introduced in 2014, extends InfiniBand benefits and allows routing over Ethernet networks. RoCE is an extension of InfiniBand with Ethernet forwarding. RoCEv2 encapsulates IB transport in Ethernet, IP, and UDP headers, so it can be routed over Ethernet networks.

Its implementation in data center networks is driven by Ethernet’s familiarity, affordability, and the possibility of creating a converged fabric for regular enterprise traffic and RDMA workloads. In AI/ML clusters, RDMA is used to communicate memory-to-memory between GPUs over the network using GPUDirect RDMA. RoCEv2 in AI/ML clusters, particularly for GPUDirect RDMA, demands a lossless fabric achieved through explicit congestion notification (ECN) and priority flow control (PFC) algorithms.

Explicit Congestion Notification (ECN)

ECN is a feature between two ECN-enabled endpoints, such as switches, which can mark packets with ECN bits during network congestion. An ECN-capable network node employs a congestion avoidance algorithm, marking traffic contributing to congestion after reaching a specified threshold. ECN facilitates end-to-end congestion notification between ECN-enabled senders and receivers in TCP/IP networks. However, proper ECN functionality requires enabling ECN on all intermediate devices, and any device lacking ECN support breaks end-to-end ECN.

ECN notifies networks of congestion, aiming to reduce packet loss and delay by prompting the sending device to decrease the transmission rate without dropping packets. WRED operates on a per-queue level, with two thresholds in the queue. The WRED minimum threshold indicates minor congestion, marking outgoing packets based on drop probability. If congestion persists and buffer utilization surpasses the WRED maximum threshold, the switch marks ECN on every outgoing packet. This WRED ECN mechanism enables endpoints to learn about congestion and react accordingly.

ECN is marked in the network node where congestion occurs within the IP header’s Type of Service (TOS) field’s two least significant bits. When a receiver receives a packet with ECN congestion experience bits set to 0x11, it sends a Congestion Notification Packet (CNP) back to the sender. Upon receiving the notification, the sender adjusts the flow to alleviate congestion, constituting an efficient end-to-end congestion management process. If congestion persists beyond the WRED maximum threshold, every packet is marked with congestion bits, significantly reducing data rate transmission. This process alleviates congestion, allowing the buffer to drain and traffic rates to recover until the next congestion signal.

Priority Flow Control (PFC)

Priority Flow Control was introduced in Layer 2 networks as the primary mechanism to enable lossless Ethernet. Flow control was driven by the class of service (COS) value in the Layer 2 frame, and congestion is signaled and managed using pause frames, and a pause mechanism. However, building scalable Layer 2 networks can be a challenging task for network administrators. Because of this, network designs have mostly evolved into Layer 3 routed fabrics.

Priority-Based Flow Control (PFC) is a congestion relief feature ensuring lossless transport by offering precise link-level flow control for each IEEE 802.1p code point (priority) on a full-duplex Ethernet link. When the receive buffer on a switch interface reaches a threshold, the switch transmits a pause frame to the sender, temporarily halting transmission to prevent buffer overflow. The buffer threshold must allow the sender to stop transmitting frames before overflow occurs, and the switch automatically configures queue buffer thresholds to avoid frame loss.

In cases where congestion prompts the pause of one priority on a link, other priorities continue to send frames, avoiding a complete transmission halt. Only frames of the paused priority are held back. When the receive buffer falls below a specific threshold, the switch sends a message to resume the flow. However, pausing traffic, depending on link traffic or priority assignments, may lead to ingress port congestion and propagate congestion through the network.

With PFC enabled, a dedicated class of service ensures lossless transport, treating traffic in this class differently from other classes. Ports configured with PFC receive a dedicated no-drop queue and buffer. To maintain lossless capabilities, the queue has two thresholds: the xOFF threshold, set higher in the buffer, triggers PFC frames to halt traffic towards the source, and the xON threshold, reached after the buffer starts draining, stops the transmission of pause frames, indicating the system’s belief that congestion has subsided.

Priority Flow Control (PFC) was introduced in Layer 2 networks to enable lossless Ethernet, driven by the Class of Service (COS) value in Layer 2 frames. Congestion is signaled and managed through pause frames and a pause mechanism. However, building scalable Layer 2 networks poses challenges, leading to the evolution of network designs into Layer 3 routed fabrics. As RoCEv2 supports routing, PFC is adapted to work with Differentiated Services Code Point (DSCP) priorities for signaling congestion between routed hops. DSCP, used for classifying network traffic on IP networks, employs the 6-bit differentiated services field in the IP header for packet classification. Layer 3 marking ensures traffic maintains classification semantics across routers. PFC frames, utilizing link-local addressing, enable pause signaling for routed and switched traffic. PFC is transmitted per hop, from the congestion point to the traffic source. While this step-by-step process may take time to reach the source, PFC is the primary tool for congestion management in RoCEv2 transport.

Data Center Quantized Congestion Notification (DCQCN)

The collaborative approach of Priority-Based Flow Control (PFC) and Explicit Congestion Notification (ECN), managing congestion together, is termed Data Center Quantized Congestion Notification (DCQCN), explicitly developed for RoCE networks. This joint effort of PFC and ECN ensures efficient end-to-end congestion management. During minor congestion with moderate buffer usage, WRED with ECN seamlessly handles congestion. PFC gets activated to manage congestion in cases of more severe congestion or microbursts causing high buffer usage. To optimize the functioning of both PFC and ECN, appropriate thresholds must be set.

In the example provided, WRED minimum and maximum thresholds get configured for lower buffer utilization to address congestion initially, while the PFC threshold is set higher as a safety net to manage congestion after ECN. Both ECN and PFC operate on a no-drop queue, ensuring lossless transport.

Data Center Quantized Congestion Notification (DCQCN) combines ECN and PFC to achieve end-to-end lossless Ethernet. ECN aids in overcoming the limitations of PFC for achieving lossless Ethernet. The concept behind DCQCN is to enable ECN to control flow by reducing the transmission rate when congestion begins, minimizing the time PFC is triggered, which would otherwise halt the flow entirely.

DCQCN Requirements

Achieving optimal performance for Data Center Quantized Congestion Notification (DCQCN) involves balancing conflicting requirements, ensuring neither premature nor delayed triggering of Priority-Based Flow Control (PFC). Three crucial parameters must be carefully calculated and configured:

1. **Headroom Buffers:** When a PAUSE message gets sent upstream, it takes time to be effective. The PAUSE sender should reserve sufficient buffer capacity to process incoming packets during this interval to prevent packet drops. QFX5000 Series switches allocate headroom buffers on a per-port, per-priority basis, carved from the global shared buffer. Control over headroom buffer allocation can be exercised using MRU and cable length parameters in the congestion notification profile. In case of minor ingress drops persisting post-PFC triggering, resolving them involves increasing headroom buffers for the specific port and priority combination.

2. **PFC Threshold:** Representing an ingress threshold, this threshold signifies the maximum size an ingress priority group can attain before triggering a PAUSE message to the upstream device. Each PFC priority has its priority group at each ingress port. PFC thresholds typically consist of two components—the MIN threshold and the shared threshold. PFC is activated for a corresponding priority when both MIN and shared thresholds are reached, with a RESUME message sent when the queue falls below these PFC thresholds.

3. **ECN Threshold:** This egress threshold equals the WRED start-fill-level value. Once an egress queue surpasses this threshold, the switch initiates ECN marking for packets in that queue. For effective DCQCN operation, this threshold must be lower than the ingress PFC threshold, ensuring ECN marking precedes PFC triggering. Adjusting the WRED fill level, such as setting it at 10 percent with default shared buffer settings, enhances ECN marking probability. However, a higher fill level, like 50 percent, may impede ECN marking, as ingress PFC thresholds would be met first in scenarios with two ingress ports transmitting lossless traffic to the same egress port.

Summary

In summary, the interconnecting “network” between GPU nodes has to offer the following requirements;

  • Offer high bandwidth and low latency, congestion-free, non-blocking, and not oversubscribed.
  • Offer support for RMDA
  • Provide high visibility and the ability to adjust to reduce hot spots.
  • Spine-leaf design for lowest latency and scalability
  • Provide an end-to-end solution from host to host (Ethernet or HCA Infiniband cards supporting RMDA, and priority flow control (PFC)
  • 100GB to the host, 400 or 800GB uplinks from leafs to spines.
  • The number and the backplane throughput of spines must be greater than the front panel ports of the leafs connected. (Non-blocking)
  • L3 utilizing ECMP, or even better a switch platform with ASICs that can do bit spraying across the links. (few options today)
  • Infiniband is potentially a better turn-key solution, however requires a dedicated network just for the node to node GPU connectivity.
  • RoCE is a viable alternative but requires a lot of custom tuning of ECN and PFC to eliminate hot spots of congestion during full training runs.

Optimizing AI/ML Performance: Cutting-Edge GPU Network Designs-Part 4- DMA and RDMA for the network Engineer

The Business needs for Machine Learning and Large Language Models (LLM)

The foundation for the open models of ChatGPT, Microsoft BARD, Meta, and others is based on large language models (LLM). This computer algorithm processes natural language inputs and predicts the next word based on what it’s already seen. Then, it indicates the next word, the next term, and so on until its answer is complete. LLMs are a type of AI that are currently trained (Machine Learning or ML)on a massive trove of articles, Wikipedia entries, books, internet-based resources, and other input to produce human-like responses to natural language queries. That’s an immense amount of data and requires massive computing resources. This open model interfaces to underlying AI functionality known in AI terms as a model. An AI model is a mathematical representation—implemented as an algorithm or practice—that generates new data that will (hopefully) resemble a data set you already have.

One of the most important things to remember here is that while there is human intervention in the training process, most learning and adapting happens automatically. Millions, billions, and even trillions of iterations are required to get the models to the point where they produce exciting results, so automation is essential. The process is highly computationally intensive, and much of the recent explosion in AI capabilities has been driven by advances in GPU computing power and techniques for implementing parallel processing on these chips.

The biggest challenge for using these cloud-based open model interfaces is that they are available to everyone. Whatever you input into ChatGPT and other public engines is also used for training the public LLM. If proprietary data is input into one of these open models, that data set is then on the public internet for everyone to see and use. Most companies have a policy on employee use of these models in the jobs and may block these general sites or have strict guidelines on usage to prevent private data from becoming public. Today, hundreds of large open-source LLMs are available, which will grow exponentially.

Because of this growth and the security ramifications, companies and organizations must front-end these LLMs for use and fine-tune them with private data for task-specific goals. Companies and organizations can benefit from using these LLMs with their personal data, and adopting AI/ML will be paramount in a company’s future. There is no question about this fact. As we saw with retail, the companies that did not adopt an online presence are either bankrupt or no longer there. Adopting AI/ML is a must-have for any company moving forward to compete and be successful.

The exciting thing about ChatGPT and other interfaces to public AI/ML models is that only the data scientists would drive the AI discussion within a company in the past. Today, everybody engages with AI, from sales, marketing, and finance to almost everyone. It’s been democratized to such an extent that now everybody, from the CEO to the average worker, has a point of view on how it can be used. So, companies must be on their front foot and have a strategy around what that means for the Business, not just speak to the shiny objects they’re doing in the organization.

The challenge now is to “train” these open LLMs with your own private data in-house, which requires massive computational resources, mainly in graphic processing units or GPUs. Connecting these GPUs together requires high-speed connectivity between the computer housing the GPUs and other computer/GPU resources in the data center.

High-Performance Computing (HPC) and Machine Learning (ML) workloads demand intensive processing, often relying on hardware acceleration to efficiently solve large and complex problems. There are two common acceleration types: interconnect acceleration, which provides high bandwidth and low latency, and compute acceleration, which leverages highly parallel GPU compute engines. The accelerations can be combined using GPUDirect RDMA or RMDA with MPI, allowing advanced HPC and ML applications to utilize the cross-host coupling of multiple GPUs for scalable problem-solving.

DMA (Direct Memory Access)

DMA (Direct Memory Access) is a computer system feature facilitating data transfer between memory and peripheral devices, such as hard drives, without CPU intervention. Unlike Programmed I/O, where a general-purpose processor manages data transfer, DMA minimizes CPU involvement in large data transfers, reducing the need to watch status bits and feed data into the controller register byte-byte.

To initiate DMA transfer, the host writes a command block into memory, specifying the source, destination, and byte count. The block can handle non-contiguous addresses. After initiating, the CPU is free for other tasks, while the DMA controller independently operates the memory bus, placing addresses without CPU intervention. Modern computers typically incorporate a standard DMA controller.

GPUDirect P2P and GPUDirectRDMA

GPUDirect P2P and GPUDirectRDMA are components within the GPUDirect family of NVIDIA technologies, facilitating direct data exchange between GPUs, third-party network adapters, and various devices using PCIe standard features. It encompasses peer-to-peer (P2P) transfers between GPUs and remote direct memory access (RDMA). GPUDirect P2P allows direct data exchange between GPU memories on a single host, reducing the need to copy data to host memory and relieving the host CPU. If we had 8 GPUs in a single server, this would significantly reduce CPU burden and speed up the transfers by eliminating extra clock ticks, which reduces latency.


In the absence of GPUDirect P2P, the host CPU is required to shuttle data from the memory of one GPU to the host memory and then into the memory of the second GPU. GPUDirect P2P eliminates the need for host CPU intervention and allows direct data transfer between GPUs without involving host memory.

Remote Direct Memory Access (RDMA)

Remote Direct Memory Access (RDMA) is a widely recognized technology in high-performance computing (HPC) and storage networking environments. RDMA offers notable advantages, primarily characterized by its capacity for high throughput and low-latency information transfer between compute nodes.

What sets RDMA apart is its ability to operate at the memory-to-memory level, circumventing CPU involvement. The I/O transfer function is seamlessly offloaded to the network adapter hardware, bypassing the conventional operating system software’s new work stack. This allows the remote applications to communicate directly, eliminating the Sockets, Transport Driver, and NIC Driver buffers. Bypassing these buffers saves clock ticks and memory bandwidth. This not only contributes to elevated efficiency but also minimizes the burden on the CPU. Another noteworthy advantage of RDMA is its capability to significantly reduce power requirements, adding to its appeal as a technology solution for resource-efficient and high-performance computing environments.

GPUDirect RDMA

GPUDirect RDMA, a multi-host version, enables a Host Channel Adapter (HCA) to directly read and write GPU memory data buffers, transferring the data through a remote HCA to a GPU on another host without involving the host CPU or copying data to host memory. GPUDirect RDMA demonstrates significant performance improvements for GPU-to-GPU accelerated High-Performance Computing (HPC) and Machine Learning/Deep Learning (ML/DL) applications.

In the absence of GPUDirect RDMA, the host CPU is responsible for copying data from GPU memory to host memory and transmitting it via RDMA (InfiniBand) to the remote host. GPUDirect RDMA eliminates host CPU involvement by enabling direct transfer of data from GPU memory to the remote host.

Remember the discussion in part 2 of the article about the speed of electricity in a motherboard? Using GPUDirect RDMA, we bypass the CPU and system memory via the PCIe bus by offering direct access to the Infiniband or ethernet card. 20mm of the circuit board is 1 clock tick, and eliminating 100mm or more of the circuit board path decreases memory latency, CPU load, and overall inefficiency.

NVLink Bus Protocol and NVSwitch

Before we discuss NVLink, we must brush on SLI—or “Scalable Link Interface” —an older technology bought and developed by NVIDIA that allows you to link together multiple similar GPUs (up to four). SLI will theoretically enable you to use all the GPUs together to complete specific computational tasks even faster, making a whole new computer for each GPU or waiting for a new generation of GPU hardware.

It works in a master-slave system wherein the “master” GPU (i.e., the topmost/first GPU in the computer, generally) “controls” and directs the “slaves” (i.e., the other GPU/s), who are connected using SLI bridges. The “master” acts as a central hub to make sure that the “slaves” can communicate and accomplish whatever task is currently being performed efficiently and stably. The “master” then collects all of this scattered information and combines it into something that makes sense.

However, SLI is dead on the vine and is being phased out by Nvidia.

GPUDirect P2P technology can significantly improve the GPU communication performance of a single GPU server. Still, it is limited by the PCI Express bus protocol, and limitations of the topology prevent higher bandwidth. To solve this problem, NVIDIA proposed the NVLink bus protocol.

NVLink has multiple generations; the Fourth-generation NVLink provides 900 GB/s of bidirectional bandwidth per GPU—1.5x greater than the prior generation and more than 5.6x higher than first-generation NVLink. And the 900GBs/sec speed is more than 7x the bandwidth of PCIe Gen 5, the interconnect used in conventional x86 servers. And NVLink sports 5x the energy efficiency of PCIe Gen 5, thanks to data transfers that consume just 1.3 picojoules per bit.

NVLink did away with that master-slave system we discussed in favor of mesh networking. In Mesh Networking, every GPU is independent and can talk directly with the CPU and every other GPU.

This—along with many pins and newer signaling protocol—gives NVLink far lower latency and allows resources like the VRAM to do far more complex calculations.

Servers like the NVIDIA DGX H100 use this technology to deliver greater scalability for ultrafast deep learning training. The fourth generation of NVIDIA® NVLink® technology provides 1.5X higher bandwidth and improved scalability for multi-GPU system configurations. A single NVIDIA H100 Tensor Core GPU supports up to 18 NVLink connections for a total bandwidth of 900 gigabytes per second (GB/s)—over 7X the bandwidth of PCIe Gen5. 

The third generation of NVIDIA NVSwitch™ builds on the advanced communication capability of NVLink to deliver higher bandwidth and reduced latency for compute-intensive workloads. Each NVSwitch has 64 NVLink ports equipped with engines for NVIDIA Scalable Hierarchical Aggregation Reduction Protocol (SHARP) for in-network reductions and multicast acceleration to enable high-speed, collective operations.

NVLink is a direct GPU-to-GPU interconnect to enable high-speed, collective operations connect that scales multi-GPU input/output (IO) within the server. NVSwitch combines multiple NVLinks to provide all-to-all GPU communication at full NVLink speed within a single node and between nodes.    

The NVSwitch enables eight GPUs in an NVIDIA DGX H100 system to cooperate in a cluster with full-bandwidth connectivity.

NVSwitch can extend NVLink connections across nodes to create seamless, high-bandwidth GPU-GPU connectivity. NVSwitch can develop the NVLink connection into a data center-sized GPU. By adding a second tier of NVLink Switches externally to the servers, future NVLink Switch Systems can connect up to 256 GPUs and deliver a staggering 57.6 terabytes per second (TB/s) of all-to-all bandwidth, making it possible to rapidly solve even the most significant AI jobs. 

One thing to note is that NVLink and NVSwitch are proprietary to NVIDIA. NVIDIA makes its own servers utilizing the proprietary NVLink and NVSwitch to bypass the PCIe bus and allow GPU-to-GPU communications. These servers also feature the NVIDIA ConnectX-7 InfiniBand and VPI (Infiniband or Ethernet) adapters, each running at 400 gigabits per second (Gb/s) to create a high-speed fabric for large-scale AI workloads. DGX A100 systems are also available with ConnectX-6 adapters.

The MPI Library and multi-node GPU support

Without a doubt, NVIDIA has a more advanced multi-GPU strategy with GPUDirect, NVLink, and NVSwitch. The other GPU vendors, such as Intel and AMD (AMD does offer Infinity Fabric), rely on the MPI library to provide CPU and PCIe bus bypass for GPU-to-GPU communications.

GPU data transfer using traditional MPI GPU to MPI GPU Remote method

The MPI library requires much more care from developers to get GPU-to-GPU balancing correct. In the Naive model, developers handle data movements between the host and the device. In this model, applications making frequent MPI communications might suffer from excessive data movements between the host and the device.

The underlying MPI library is aware of GPU buffers in the GPU Aware MPI model. Hence, developers are relieved from the task of explicit data movement between the host and the device.

The Kernel-Based Collectives model supports the execution of MPI scale-up kernels by the device directly. This approach helps eliminate data transfers between the host and device that were made owing to the execution of MPI calls on the host.

AMDs solutions

AMD offers their Instinct MI300X Platform, which integrates 8 fully connected MI300X GPU OAM modules onto an industry-standard OCP design via 4th-Gen AMD Infinity Fabric™ links, delivering up to 1.5TB HBM3 capacity for low-latency AI processing. AMD ROCm is an Open software stack with a broad set of programming models, tools, compilers, libraries, and runtimes for AI and HPC solution development on AMD GPUs. Also, the AMD RCCL software communications library is for high-performance cross-GPU operations like gather and scatter, used for distributed training.

At each layer of the stack, AMD has built software libraries (ROCm, RCCL) or networking infrastructure (Infinity Fabric) or adopted existing networking infrastructure (Infiniband or RoCE) to support workloads like LLM training. If you’re familiar with NVIDIA’s platform, you’ll see that many components of AMD’s platform directly map to those of NVIDIA’s. 

InfiniBand and RDMA

The de facto standard of supercomputer and GPU connectivity is InfiniBand and RDMA. While it works perfectly, it is a proprietary solution from Mellanox, acquired by NVidia, the world’s leading GPU maker. Also, Infiniband is a back-end network that connects compute nodes for GPU-GPU clustering using RMDA and storage connectivity. Infiniband also requires a front-end network to provide connectivity to the outside world.

The surge in large AI models represented by ChatGPT in 2023 has significantly amplified the focus on InfiniBand technology. This is because the networks GPT models use are built on InfiniBand, developed by NVIDIA.

In its initial deployment, InfiniBand (IB) emerged as the pioneering application that brought the comprehensive advantages of Remote Direct Memory Access (RDMA) to the forefront. This groundbreaking technology introduced high throughput and CPU bypass capabilities, significantly reducing latency. InfiniBand also incorporated congestion management directly into its protocol, making it the preferred transport for high-performance computing (HPC) applications. RMDA can allow GPU-GPU communications via GPUDirect on NVIDIA GPUs and MPI on Intel and AMD GPUs.

InfiniBand is a communication link for data flow between processors and I/O devices, supporting up to 64,000 addressable devices. InfiniBand Architecture (IBA) is the industry-standard specification that defines a point-to-point switching input/output framework, typically used for interconnecting servers, communication infrastructure, storage devices, and embedded systems. InfiniBand features universality, low latency, high bandwidth, and low management costs. It is an ideal connection network for single-connection multiple data streams (clustering, communication, storage, management), with interconnected nodes reaching thousands.

InfiniBand networks utilize a point-to-point connection where each node communicates directly with other nodes through dedicated channels, reducing network congestion and improving overall performance. This architecture supports Remote Direct Memory Access (RDMA) technology, enabling data transfer directly between memories without involving the host CPU, further enhancing transfer efficiency.

The smallest complete unit in the InfiniBand Architecture is a subnet, and multiple subnets are connected by routers to form an extensive InfiniBand network. Each subnet comprises end nodes, switches, links, and subnet managers. InfiniBand networks find applications in scenarios such as data centers, cloud computing, high-performance computing (HPC), machine learning, and artificial intelligence. The core objectives include maximizing network utilization, CPU utilization, and application performance.

However, as enterprise networks began to integrate HPC workloads, the utilization of InfiniBand necessitated the establishment of dedicated networks tailored to leverage its specific benefits. While optimizing performance, Infiniband purpose-built back-end networks introduce additional complexities and costs to enterprise infrastructures.

RDMA over Converged Ethernet (RoCE)

Recognizing the need for a more accessible and versatile implementation of RDMA for network transport, several alternatives were explored. One such implementation, RDMA over Converged Ethernet (RoCE), emerged as a significant evolution in this space. Standardized by the InfiniBand Trade Association (IBTA) and introduced in 2010, RoCE operates seamlessly within the same Layer 2 broadcast domain, offering a practical and efficient solution.

Introducing RoCE version 2 in 2014 marked a pivotal advancement by enabling traffic routing and expanding its applicability. Unlike its predecessor, RoCEv2 allows for the traversal of Layer 3 boundaries, introducing greater flexibility in network design. Essentially, RoCE is an extension of the InfiniBand framework, integrating the benefits of RDMA with Ethernet forwarding capabilities.

To facilitate interoperability with Ethernet networks, RoCEv2 encapsulates InfiniBand transport within Ethernet, IP, and UDP headers. This enables the routing of RoCE traffic over traditional Ethernet networks, providing a seamless bridge between InfiniBand and Ethernet environments. As a result, RoCE emerges as a versatile and efficient solution, allowing organizations to leverage the benefits of RDMA while aligning with the existing Ethernet infrastructure.

Because of the vendor lock-in of Infiniband and NVIDIA and the advances in 400 and 800 GB ethernet switching, we can use ROCEv2 and tune the ethernet network using PFC and ECN to create a lossless GPU fabric rivaling or performing better than InfiniBand (more on this in future articles). Due to power and space requirements, most customers must squeeze this new AI/ML infrastructure into existing DC space, so converged Ethernet is attractive. It looks for most, if not all, customers to use 400 and 800 switches with ROCEv2 and tuning to support their AI/ML initiatives.

To give you an idea of how important this is, it is estimated that by 2027, 1 in 5 ethernet ports deployed in the data center will be exclusively used for AI/ML infrastructure. That’s a massive opportunity for vendors and data center design engineers to architect this new infrastructure properly.

Storage Protocols for AI/ML Infrastructures

We have focused on how GPUs work and how to connect them inside the same host or over the Infiniband or other RoCE networks using RMDA. These mission-critical enterprise AI/ML workloads require that data be correctly written to the storage device and not corrupted as it travels from server to storage media. InfiniBand enables the highest levels of data integrity by performing cyclic redundancy checks (CRCs) at each fabric hop and end-to-end across the Fabric to ensure the data is correctly read and written between the server and storage device. Data on storage devices can be fully protected by application-level standards such as T10/DIF (Data Integrity Feature), which adds 8 bytes of protection data to each storage block, including a CRC field and a reference tag to ensure the correct data is written to the right location. Mellanox’s ConnectX HCA fully offloads the DIF calculations to maintain peak performance.

InfiniBand Storage Area Networks can be seamlessly implemented while protecting previous investments in legacy Fibre Channel, iSCSI, and NAS storage devices by using IB to FC and IB to IP gateway products from leading vendors like Cisco, Qlogic, and Voltaire.

There are 2 block level protocols available for storage applications on InfiniBand:

  • SRP (SCSI RDMA Protocol)
  • iSER (iSCSI Extensions for RDMA)
  • The NFSoRDMA protocol provides high-performance file-level storage access on InfiniBand.

SRP is an open ANSI T10.org standard protocol that encapsulates SCSI commands and controls data transfer over RDMA-capable Fabric such as InfiniBand.

iSER is an IETF standard that maps the block-level iSCSI protocol over RDMA-capable fabrics like InfiniBand. Like SRP, this permits data to be transferred directly between server and storage devices without intermediate data copies or involvement of the operating system.

NFSoRDMA is an IETF RFC that leverages the benefits of RDMA to offload the protocol processing and avoid data copies to significantly lower CPU utilization and increase data throughput. NFSoRDMA is already part of the Linux Kernel and will be part of commercial Linux distributions.

Storage vendors are quickly introducing solutions based on InfiniBand to address the strong market demand led by financial and clustered database applications and parallel technical computing applications. Both block-level and file-level storage solutions are available in the market today.

NVMe over Fabric (NVMe-oF)

NVMe over Fabric is the storage protocol that expands the NVMe function by enabling the host
access to remote NVMe storage pools over a network fabric. The first NVMe-oF 1.0 specification was released in June 2016 and extended NVMe technology to additional transports beyond PCIe, such as Ethernet, Fibre Channel, and InfiniBand. The NVMe-oF 1.1 specification, released in 2019, added finer grain I/O resource management, end-to-end flow control, support for NVMe/TCP, and improved fabric communication. With several protocol choices available for implementing NVMe-oF, each with its advantages and use cases, organizations can explore high-performance storage networks close to directly attached solutions.
The simple NVMe-oF diagram below helps visualize the available options for transport.

The three main NVMe-oF storage fabrics supported by NVMe are:

  1. NVMe over Fabrics using Fibre Channel (FC-NVMe)
    FC-NVMe is a protocol that extends NVMe over traditional Fibre Channel networks by encapsulating the NVMe command set in FC Frames. It relies on standard FC processes such as zoning and integrates seamlessly into the current FC protocol. It provides low latency, high bandwidth, and reliable connectivity, making it a consideration for those enterprise storage environments where Fibre Channel is already deployed.
  2. NVMe over Fabrics using RDMA (NVMe-oF RDMA)
    Remote Direct Memory Access (RDMA) is a technology that allows data to be transferred directly between the memory of two systems without involving the CPU or OS, reducing latency and CPU overhead in data transfer. NVMe-oF RDMA uses RDMA protocols, such as
    InfiniBand or RDMA over Converged Ethernet (RoCE) to enable high-performance, low-latency access to NVMe storage devices over the network. It is suitable for applications that require ultra-low latency and high throughput, such as high-performance computing (HPC), data analytics, and real-time databases. The three transport protocols below all use NVMe RDMA:
    a. InfiniBand is a well-established high-speed interconnect technology. With native RDMA support, InfiniBand allows high-speed, low-latency, and efficient communication between servers and remote NVMe storage devices. InfiniBand shows wide adoption in
    dedicated networks such as high-performance computing (HPC) environments, while its adoption in the Enterprise space is limited.
    b. RDMA over Converged Ethernet (RoCEv2) enables efficient routable RDMA communication over UDP on Ethernet networks. These components allow for low-latency, high-throughput, and low-CPU utilization data transfers, making them ideal for HPC, data center, and storage networking applications. RoCEv2 has superseded RoCEv1, which only supported non-routable Layer 2 networking with limited data center bridging feature support. While RoCEv1 is still available, it is essentially obsolete within the data center
    environment.
    c. Internet Wide Area RDMA Protocol (iWARP). Communicating RMDA over TCP/IP, iWARP was offered as highly scalable and functional across standard Ethernet infrastructure without the complexity of Data Center Bridging (DCB) and Priority Flow Control (PFC), making it easy to deploy and maintain. One exception regarding infrastructure is the requirement for iWarp-capable NICs. Early adoption by vendors was limited.
  3. NVMe over Fabrics using TCP (NVMe/TCP) NVMe/TCP uses standard TCP/IP networking protocols to transport NVMe commands and data over Ethernet networks. It leverages existing Ethernet infrastructure and is easy to deploy, making it a cost-effective option for data center environments with Ethernet networks. NVMe/TCP is typically positioned for less latency-sensiti with Ethernet network stable Layer 3 networks.

NVMe over Fabrics using RDMA (NVMe-oF RDMA)


Ethernet is the most popular transport technology in the enterprise sector, relegating InfiniBand to a niche offering.
a. InfiniBand: Traditionally the protocol of choice in HPC environments, InfiniBand, as an option, would require a forklift upgrade to an existing Ethernet infrastructure – possibly making it way too costly for most enterprise data centers. Given its high level of adoption in current larger AI/ML systems for multi-node GPU-to-GPU communication, InfiniBand is a choice many customers will use, and we can adopt NVMe on Infiniband if we are building large Infiniband networks for AI.
b. RoCE V2 is the most popular method for deploying RDMA in Ethernet-based data centers. RoCE provides high performance and low latency at the adapter (in the 1-5us range) but requires a lossless Ethernet network to achieve low latency operation. This means that the Ethernet switches integrated into the network must support data center bridging and priority flow control mechanisms to maintain lossless to maintain lossless traffic. Therefore, existing infrastructure will likely need reconfiguring to support RDMA traffic. RMDA-capable NICS (RNICs) and drivers are additional considerations to support AI/ML GPU-to-GPU traffic. The considerations of Soft RoCE do allow RDMA functionality on systems without dedicated hardware support for RDMA. Soft RoCE emulates the RoCE protocol in software, which can be helpful when dedicated RDMA hardware is impractical or cost-prohibitive. (Potentially for a development or non-production AI/ML system.) It’s important to note that Soft RoCE may introduce additional latency compared to hardware-based RoCE implementations since it relies on the host system’s CPU for processing RDMA operations.
The challenge with the lossless or converged Ethernet environment is that those configurations are complex. Without Layer 3 congestion control mechanisms like ECN, scalability can be very limited in a modern enterprise context. Because of the lack of scheduling that Infiniband offers, converged Ethernet requires that the buffers, QOS, and ECN be managed and adjusted for hot spots as the training of data models occurs. RoCE V2 also is not a plug-and-play solution that would necessarily fit into an existing data center infrastructure without significant cost and skill set investment.

iWARP

iWARP was initially positioned as an alternative to RoCE. iWARP has seen a lack of industry adoption and limited vendor support. Its stated advantages are its ability to support RDMA while integrating into existing TCP network infrastructure, with reasonable latencies of 10-15us, and scale more quickly than RoCE. That integration, however, involves additional complex transition layers in the transport section of the network stack. This complexity introduced inefficiencies in the stack and comprised the fundamental targets of high throughput, low latency, and low CPU utilization. Ultimately, iWARP demonstrated that its implementation is more complex
and expensive, requiring some dedicated hardware (NIC). All while being less performant than RoCE. Given its low level of adoption, iWARP will not be discussed further in this document.3. NVMe over Fabrics using TCP (NVMe/TCP).

Summary

We have learned how DMA and RDMA can bypass the PCIe bus and allow GPU-to-GPU communications to speed up training models. NVLink and NVSwith from NVIDIA allow even faster GPU-to-GPU speed and training runs. We have also learned about Infiniband and converged Ethernet with ROCe, which can use the RMDA protocol to enable GPU-to-GPU communications. Finally, storage connectivity for our AI/ML infrastructure was discussed and must not be overlooked. The following article will examine the various AI compute node connectivity options and best practices for each.

At WWT, we are building vendor-agnostic AI/ML reference architectures for our customers and engineering staff to stay at the forefront of this revolution and help our customers navigate this exciting and disruptive technology.