Optimizing AI/ML Performance: Cutting-Edge GPU Network Designs-Part 7-Cisco’s Blueprint for AI/ML Networking

Cisco Data Center Networking Blueprint for AI/ML Applications

Widely available GPU-accelerated servers, combined with better hardware and popular programming languages like Python and C/C++, along with frameworks such as PyTorch, TensorFlow, and JAX, simplify the development of GPU-accelerated ML applications. These applications serve diverse purposes, from medical research to self-driving vehicles, relying on large datasets and GPU clusters for training deep neural networks. Inference frameworks apply knowledge from trained models to new data, with optimized clusters for performance.

The learning cycles involved in AI workloads can take days or weeks, and high-latency communication between server clusters can significantly impact completion times or result in failure. AI workloads demand low-latency, lossless networks, requiring appropriate hardware, software features, and configurations. Cisco Nexus 9000 switches, with their capabilities and tools like Cisco Nexus Dashboard Insights and Nexus Dashboard Fabric Controller, offer an ideal platform for building a high-performance AI/ML network fabric.

The Network Design parameters

A two-tier, spine-switch-leaf-switch design with Cisco Nexus switches is recommended for building a non-blocking network with low latency and scalability. For a GPU cluster with 1024 GPUs, utilizing 8 GPUs per server, and 2 ports of 100Gbps per server, Cisco Nexus 93600CD-GX switches at the leaf layer and Nexus 9332D-GX2B switches at the spine layer are proposed.

Each Cisco Nexus 93600CD-GX switch has 28 ports of 100G for server connections and 8 uplinks of 400G, ensuring a non-blocking switch. To accommodate 256 server ports, 10 leaf switches are required, providing redundancy by dual-homing servers to two separate leaf switches. The network design also allows for connecting storage devices and linking the AI/ML server cluster to the enterprise network.

The spine switches, Cisco Nexus 9332D-GX2B, each have 20 x 400G ports, leaving 12 ports free for future expansion. Redundancy is achieved with four spine switches. The system is designed to support scalability, ensuring additional leaf switches can be added without compromising the non-blocking nature of the network.

The Cisco Nexus 93600CD-GX leaf switches and Cisco Nexus 9332D-GX2B spine switches have a low latency of 1.5 microseconds, providing a maximum end-to-end latency of ~4.5 microseconds for traffic crossing both leaf and spine switches. Congestion is managed by WRED ECN, preserving RoCEv2 transport latency on endpoints.

The network can easily expand by adding more leaf switches or doubling spine capacity with Cisco Nexus 9364D-GX2A switches. Alternatively, a three-tier design can interconnect multiple non-blocking network fabrics.

Designed for AI/ML workloads with a massively scalable data center (MSDC) approach, the network utilizes BGP as the control plane to Layer 3 leaf switches. For multi-tenant environments, an MP-BGP EVPN VXLAN network can be employed, allowing network separation between tenants. The principles described are suitable for both simple Layer 3 and VXLAN designs.

RoCEv2 as Transport for AI Clusters

RDMA is a widely used technology in high-performance computing and storage networking, offering high throughput and low-latency memory-to-memory communication between compute nodes. It minimizes CPU involvement by offloading the transfer function to network adapter hardware, bypassing the operating system’s network stack and resulting in reduced power requirements.

AI Clusters Require Lossless Networks

For RoCEv2 transport, a network needs to ensure high throughput and low latency while preventing traffic drops during congestion. Cisco Nexus 9000 switches, designed for data center networks, offer low latency and up to 25.6Tbps of bandwidth per ASIC, meeting the demands of AI/ML clusters using RoCEv2 transport. Additionally, these switches provide support and visibility in maintaining a lossless network environment through software and hardware telemetry for ECN and PFC.

Explicit Congestion Notification (ECN)

In scenarios requiring end-to-end congestion information propagation, Explicit Congestion Notification (ECN) is employed for congestion management. The ECN bits are set to 0x11 in the IP header’s Type of Service (TOS) field when congestion is detected, triggering a congestion notification packet (CNP) from the receiver to the sender. Upon receiving the congestion notification, the sender adjusts the flow accordingly. This built-in, efficient process occurs in the data path between two ECN-enabled endpoints.

Nexus 9000 switches, in the case of network congestion, mark packets with ECN bits using a congestion avoidance algorithm. The switch utilizes Weighted Random Early Detection (WRED) on a per-queue basis. WRED sets two thresholds in the queue: a minimum threshold for minor congestion and a maximum threshold. As buffer utilization grows, WRED marks outgoing packets based on drop probability, represented as a percentage of all outgoing packets. If congestion persists beyond the maximum threshold, the switch marks ECN to 0x11 on every outgoing packet, enabling endpoints to detect and respond to congestion.

Hosts 1-4 GPUs are sending RoCE traffic to Host 9’s GPU. The egress traffic on the port connected to Host 9 is oversubscribed, and the buffer on Leaf5 reaches WRED minimum. Leaf5 starts marking several of the packets headed to Host 9 with ECN 0x11 to indicate congestion in the data path. The NIC of Host 9, now starts sending Congestion Notification Protocol (CNP) packets back to the source (Hosts 1-4). As only some of the data packets were marked with congestion-experienced bits, the source reduces traffic throughput for that flow and continues to send packets. If congestion persists and buffer usage exceeds the WRED maximum threshold, the switch marks every packet with congestion-experienced bits.

This results in the sender receiving numerous CNP packets, prompting its algorithm to significantly reduce the data rate transmission towards the destination. This proactive measure helps alleviate congestion, allowing the buffer to gradually drain. As this occurs, the traffic rate should increase until the next congestion signal is triggered. Remember that in order to make this work properly everything in the path MUST support ECN and sending and processing CNP packets to reduce the traffic. Also, the WRED buffers must be configured properly so we don’t mark ECN too early or too late to keep traffic flowing properly.

How PFC Works

Enabling PFC on the Cisco Nexus 9000 switch designates a specific class of service for lossless transport, treating its traffic differently from other classes. Any port configured with PFC is assigned a dedicated no-drop queue and a dedicated buffer. For lossless capabilities, the queue features two thresholds. The xOFF threshold, positioned higher in the buffer, triggers the generation of a PFC frame when reached, and sent back to the traffic source. Once the buffer starts draining and falls below the xON threshold, pause frames are no longer sent, indicating the system’s assessment that congestion has subsided.

Leaf5 receives traffic coming from hosts 1-4. This leads to congestion on the port toward Host9. At this point, the switch uses its dedicated buffer to absorb incoming traffic. The traffic is buffered by Leaf5, and after the PFC xOFF threshold is reached it will send a pause frame to the upstream hop, in this diagram Spine-1. After Spine-1 receives the pause frame, it stops transmitting traffic, which prevents further congestion on the Leaf5. At the same time, Spine-1 starts buffering traffic in the no-drop queue, and after the buffer reaches the xOFF threshold, it sends pause frames down to Leaf1, sending traffic toward the spine switch. This hop-by-hop propagation of pause behavior continues, the leaf switches receive pause frames, and the switches pause transmission to the spine switch and start buffering traffic. This leads to Leaf1 sending pause frames to Host1-4.

After the senders receive a pause frame, the stream slows down. This allows the buffers to be drained on all the switches in the network. After each device reaches the xON threshold, that device stops propagating pause frames. After the congestion is mitigated and pause frames are no longer sent, Host X starts receiving traffic again.

The pause mechanism, triggered by network congestion or an endpoint with PFC, ensures that all devices in the path receive a pause frame before halting transmission to prevent packet drops. In rare instances, a PFC storm may occur if a misbehaving host continuously sends PFC frames, potentially causing network disruption. To address this, the PFC watchdog feature sets an interval to monitor if packets in a no-drop queue drain within a specified time. If the time is exceeded, all outgoing packets on interfaces matching the problematic PFC queue are dropped, preventing a network PFC deadlock.

Remember that in order to make this work properly everything in the path MUST support PFC and the ability to reduce the traffic. Also, the PFC buffers must be configured properly so we don’t enable PFC too early or too late to keep traffic flowing properly.

Using ECN and PFC Together to Build Lossless Ethernet Networks

As demonstrated in previous examples, both ECN and PFC excel at managing congestion independently, but their combined effectiveness surpasses individual performance. ECN reacts initially to alleviate congestion, while PFC acts as a fail-safe to prevent traffic drops if ECN’s response is delayed and buffer utilization increases. This collaborative congestion management approach is known as Data Center Quantized Congestion Notification (DCQCN), developed for RoCE networks.

The synergy between PFC and ECN ensures efficient end-to-end congestion control. During minor congestion with moderate buffer usage, WRED with ECN seamlessly manages congestion. However, severe congestion or microburst-induced high buffer usage triggers PFC intervention for effective congestion management. To optimize the functionality of WRED and ECN, appropriate thresholds must be set. In the example, WRED’s minimum and maximum thresholds are configured to address congestion initially, while the PFC threshold serves as a safety net post-ECN intervention. Both ECN and PFC operate on a no-drop queue, guaranteeing lossless transport.

Using Approximate Fair Drop (AFD)

An alternative method for congestion management involves utilizing advanced QoS algorithms, such as approximate fair drop (AFD) present in Cisco Nexus 9000 switches. AFD intelligently distinguishes high-bandwidth (elephant flows) from short-lived, low-bandwidth flows (mice flows). By identifying the elephant flows, AFD selectively marks ECN bits with 0x11 values, proportionally to the bandwidth utilized. This targeted marking efficiently regulates the flows contributing most to congestion, optimizing performance for minimal latency.

Compared to WRED, AFD offers the advantage of granular flow differentiation, marking only high-bandwidth elephant flows while sparing mice flows to maintain their speed. In AI clusters, allowing short-lived communications to proceed without long data transfers or congestion slowdowns is beneficial. By discerning and regulating elephant flows, AFD ensures faster completion of many transactions while avoiding packet drops and maintaining system efficiency.

Using NDFC to Build Your AI/ML Network

Regardless of the chosen network architecture, whether Layer 3 to the leaf switch or employing a VXLAN overlay, the Cisco Nexus Dashboard Fabric Controller (NDFC), also known as the Fabric Controller service, offers optimal configurations and automation features. With NDFC, the entire network, including QoS settings for PFC and ECN, can be configured rapidly. Additionally, the Fabric Controller service facilitates the automation of tasks such as adding new leaf or spine switches and modifying access port configurations.

To illustrate, consider the creation of a network fabric from the ground up using eBGP for constructing a Layer 3 network. This is achieved by utilizing the BGP fabric template.

To implement Quality of Service (QoS) across the entire fabric, the Advanced tab provides the option to select the template of your choice. This ensures uniform configuration across all switches, enabling consistent treatment of RoCEv2 traffic throughout the fabric. Some configurations need to be deployed using the freeform method, where native CLI commands are directly sent to the switch for setup. In this freeform configuration, the hierarchy and indentation must align with the structure of the running configuration on the switch.

Using Nexus Dashboard Insights to monitor and tune

Cisco Nexus Dashboard Insights offers detailed ECN mark counters at the device, interface, and flow levels. It also provides information on PFC packets issued or received on a per-class-of-service basis. This real-time congestion data empowers network administrators to fine-tune the network for optimal congestion response.

Utilizing the granular visibility of Cisco Nexus Dashboard Insights, administrators can observe drops and adjust and tune WRED or AFD thresholds to eliminate drops under normal traffic conditions—a crucial first step in ensuring effective handling of regular congestion in AI/ML networks. In scenarios of micro-burst conditions, where numerous servers communicate with a single destination, administrators can leverage counter data to refine WRED or AFD thresholds alongside PFC, achieving a completely lossless behavior. Once drops are mitigated, ECN markings and PFC RX/TX counters’ reports assist in further optimizing the system for peak performance.

The operational intelligence engine within Cisco Nexus Dashboard Insights employs advanced alerting, baselining, correlating, and forecasting algorithms, utilizing telemetry data from the networking and compute components. This automation streamlines troubleshooting, facilitating rapid root-cause analysis and early remediation. The unified network repository and compliance rules ensure alignment with operator intent, maintaining the network state.

Summary

The Cisco Data Center Networking Blueprint for AI/ML Applications white paper offers a comprehensive guide to designing and implementing a non-blocking lossless fabric to support RoCEv2 allowing for RMDA GPU-Direct operations.

Utilizing the Cisco Nexus 93600CD-GX leaf switches, as well as the Cisco Nexus 9332D-GX2B spine offers an end-to-end latency of 4.5 us in a spine leaf topology. Utilizing the proper number of uplinks and speeds ensures a non-blocking fabric. Utilizing ECN and PFC to implement Data Center Quantized Congestion Notification (DCQCN) is necessary to create a lossless fabric. Also, we can utilize WRED or the more powerful AFD(switches must support this) to ensure the building blocks to create a lossless fabric.

The use of NEXUS Dashboard Fabric Controller (NDFC) should be utilized for ease of deployment of the fabric as well as the advanced queuing templates available for AI/ML. Finally, once the fabric is up and running, the WRED or AFD queues need to be monitored and tuned to eliminate hot spots of congestion in the fabric. NEXUS Dashboard Insights is the tool of choice for this so we can look at Flow Table Events (FTE) from the NEXUS switches and we can monitor the links and look for increasing ECNs and PFCs and tune the queues appropriately. This will be an iterative process tuning the fabric as well as how the GPUs are used inter-node for training runs.