Optimizing AI/ML Performance: Cutting-Edge GPU Network Designs-Part 5-Parallel GPU Node Design Challenges for the Network Engineer

In the last 4 articles, we have learned why GPUs are critical for accelerating AI/ML workloads. The discussion of memory latency and the speed of CPU clock ticks and data transfer becomes critical in HPC and GPU-intensive AI/ML training. Utilizing RMDA, whether using Infiniband or RoCE, is a must-have for any AI/ML GPU-to-GPU connectivity. Infiniband and RMDA or RoCE are also must-haves for storage connectivity (NVMe-oF storage fabrics are the optimal choice) for the compute nodes to load in the training data. We will now do a deep dive into the two network design options for supporting multi-node GPU-to-GPU communications and high-speed storage for training data.

Traditional data center networks today

Data center networking integrates resources like load balancing, switching, analytics, and routing to store and process data and applications. It connects computing and storage units within a data center, facilitating efficient internet or network data transmission.

A data center network (DCN) is vital for connecting data across multiple sites, clouds, and private or hybrid environments. Modern DCNs ensure robust security, centralized management, and operational consistency by employing full-stack networking and security virtualization. These networks are crucial in establishing stable, scalable, and secure infrastructure, meeting evolving communication needs, and supporting cloud computing and virtualization.

Traditional Three-Tier Topology Architecture

The traditional data center architecture shown on the left uses a three-layer topology designed for general networks, usually segmented into pods that constrain the location of devices such as virtual servers. The architecture consists of core layer 3 switches, distribution switches (layer 2 or layer 3), and access switches. The access layer runs from the data center to the individual nodes (computers) where users connect to the network. The three-tier network design has been around for 3+ decades and is still prevalent in up to half of today’s data centers. The design is resilient and scalable, and by using MLAG or vPC, it offers high-speed, redundant connectivity to servers and other switches. The challenge is that it is designed for North-South traffic patterns and doesn’t support HPC or AI/ML workloads.

Leaf-Spine Topology Architecture

The leaf-spine topology architecture is a streamlined network design that moves data in an east-west flow. The new topology design has only two layers – the leaf layer and the spine layer. The leaf layer consists of access switches that connect to devices like servers, firewalls, and edge routers. And the spine layer is the backbone of the network, where every leaf switch is interconnected with every spine switch. Using ECMP, the leaf-spine topology allows all connections to run at the same speed, and the mesh of fiber links creates a high-capacity network resource shared with all attached devices.

The new challenge to support AI/ML Workloads

The significant difference between HPC workloads and AI/ML workloads is that AI/ML executes many jobs on all GPU-accelerated compute nodes. In an AI/ML workload, any kind of slowdown in the network will stall ALL of the GPUs based on the slowest path.

With increasing data and model complexities, the time required to train neural networks has become prohibitively large. A typical neural network has hundreds of hidden layers and thousands of neural nodes, each performing a mathematical algorithm on the data from the previous neural node.

A typical neural network comprises inputs, weighting, and a function to calculate these inputs and derive an output. The output may then go on to another node with another algorithmic function.

These neural networks commonly use programming languages such as Python and C/C++ and frameworks such as PyTorch, TensorFlow, and JAX, which are built to take advantage of GPUs natively, and the AI/ML engineer will determine how the nodes are placed across multi-GPU systems using parallelism.

Users are turning to data parallel neural networks (DPNN) and large-scale distributed resources on computer clusters to address the exponential rise in training time. Current DPNN approaches
implement the network parameter updates by synchronizing and averaging across all processes while blocking communication operations after each forward-backward pass.

This synchronization is the central algorithmic bottleneck.

A typical cycling flow pattern of data through the neural network shows the processing performed on the batch. There is a local sync on the physical GPU if multiple GPUs are used (local on the compute node or across many compute nodes with 8 GPUs).

What is important to note is the Global Send to all GPUs and Global Receive from all GPUs. Picture a 128 or 256 GPU system connected in a full mesh. As each GPU does its local sync, a global send of ALL the data on that GPU for that batch job to ALL the other GPUs in the mesh to synchronize. In common with all GPU mesh jobs, all GPUs run their jobs by executing the instructions on the GPU. Once each GPU finishes its job, it notifies all the other GPUs in an all-to-all collective type workload. We now have a full mesh of each GPU sending notifications to all different GPUs in a collective.

The next portion is where the synchronization problem is. All GPUs must synchronize, and each GPU must wait for all other GPUs to complete, which comes from the notification messages. The AI/ML workload computation will stall, waiting for the slowest path. The JCT or Job Completion Time becomes based on the worst-case latency. This is analogous to a TCP flow waiting for the Syn/Syn-Ack before TCP can transmit again. If the buffers get full, a TCP reset is sent to the sending host to retransmit and reduce the TCP window size by 1/2, further increasing latency and JCTs.

To put it in perspective, a typical GPU cluster that we’re seeing our customers looking to deploy at max performance has roughly as much network traffic traversing it every second as there is in all of the internet traffic across America. To understand the economics of an AI data center, know that GPU servers can cost as much as $400,000 each. So, maximizing GPU utilization and minimizing GPU idle time are two of the most critical drivers of AI data center design.

Looking at the traditional DC traffic pattern versus an AI/ML pattern, we see consistent flows during the AI/ML workload job execution and spikes in traffic leading to congestion. This, in turn, causes our jobs to stall in the synchronization phase, waiting on the notifications. Depending on the load balancing algorithms in our DC architecture, we will have hot spots where we have congestion and underutilized links, leading to increased job completion times (JCTs). Besides the memory latency issue, this is the most important thing to understand when designing AI/ML network architectures. Hot spots and congestion cause jobs to stall and JCT to dramatically increase with long-lived flows. In scenarios where multiple high-bandwidth flows are hashed to share the same uplink, they can exceed the available bandwidth of that specific link, leading to congestion issues.

Today’s typical network connecting compute and GPU nodes has two main designs: 3-tier and spine-leaf. They both have the same buffer issue in common: the network switches have ingress and egress buffers to hold the data packet before it is placed on the wire between switches. If the buffers are full, they cannot transmit or receive data, which is dropped. TCP retransmission algorithm will half the packet size and retransmit. This leads to severe inefficiency, and the problem will snowball very quickly. JCTs will increase as all GPUs cannot send or receive on that link.

If we examine a typical 3-tier vPC or MLAG design, we can quickly have an issue with large, long-lived flows. The layer 2 hashing algorithm will place the flow on a single uplink from the sending switch on the TO to the distribution layer switch. The distribution layer switch will put the flow on a single downlink to the receiving TOR. It is pretty easy to see that buffer overruns will occur on these long-lived high throughput flows, causing drops, TCP retransmissions, and output windows to decrease. These all add to latency,y severely impacting JCTs. In this 3-tier design, output queuing techniques can be used to minimize the issues, and moving nodes to other TOR switches may help. Ultimately, this design will not support large-scale AI/ML training using multi-node GPUs.

The preferred network design for AI/ML deep learning is a spine-leaf architecture using Layer 3 ECMP to move the flows across multiple uplinks to multiple spines. While the spine-leaf architecture will alleviate most of the problems, we can still run into ingress buffer overruns. Remember, 1 GPU slowdown affects all of the other GPUs whenever it comes time for a global synchronization. This is the single biggest issue and will lead to increased JCTs.

Distributing the workloads across the GPUs and then synching them to train the AI model requires a new type of network that can accelerate “job completion time” (JCT) and reduce the time that the system is waiting for that last GPU to finish its calculations (“tail latency”).

Therefore, data center networks optimized for AI/ML must have exceptional capabilities around congestion management, load balancing, latency, and, above all else, minimizing JCT. ML practitioners must accommodate more GPUs into their clusters as model sizes and datasets grow. The network fabric should support seamless scalability without compromising performance or introducing communication bottlenecks.

AI/M, L multi-node GPU networking, represents a once-in-a-generation inflection point that will present us with complex technical challenges for years. We are still in the infancy of these designs, and as network architects, we must keep our eyes and ears open for what is coming.

1. High Performance 
Maximizing GPU utilization, the overarching economic factor in AI model training requires a network that optimizes for JCT and minimizes tail latency. Faster model training means a shorter time to results and lower data center costs with better optimized compute resources.

2. Open Infrastructure
Performance matters, which is why everyone invests in it. Infiband is still the leader in providing optimized AI/ML GPU-multi-node networks. However, Infiniband is a back-end network, and you must also have the front-end ethernet network for server connectivity. At some point, economics takes over as first, Infiniband is very expensive per port, and secondly, you have two networks to support your front and back end Infiniband. And economics is driven by competition, and competition is driven by openness. We’ve seen this play out in our industry before. And if I am a betting man, I am betting that Ethernet wins. Again. An open platform maximizes innovation. It’s not that proprietary technologies don’t have their roles to play, but seldom does a single purveyor of technology out-innovate the rest of the market. And it simply never happens in environments with so much at stake.

3. Experience-first Operations
Data center networks are becoming increasingly complex, and new protocols must be added to the fabric to meet AI workload performance demands. While complexity will continue increasing, intent-based automation shields the network operator from that complexity. Visibility is critical in an AI/ML training infrastructure, and the ability to see hot spots and congestion is critically important to monitor and manage.

Optimizing AI/ML Performance: Cutting-Edge GPU Network Designs-Part 7-Cisco’s Blueprint for AI/ML Networking

Cisco Data Center Networking Blueprint for AI/ML Applications

Widely available GPU-accelerated servers, combined with better hardware and popular programming languages like Python and C/C++, along with frameworks such as PyTorch, TensorFlow, and JAX, simplify the development of GPU-accelerated ML applications. These applications serve diverse purposes, from medical research to self-driving vehicles, relying on large datasets and GPU clusters for training deep neural networks. Inference frameworks apply knowledge from trained models to new data, with optimized clusters for performance.

The learning cycles involved in AI workloads can take days or weeks, and high-latency communication between server clusters can significantly impact completion times or result in failure. AI workloads demand low-latency, lossless networks, requiring appropriate hardware, software features, and configurations. Cisco Nexus 9000 switches, with their capabilities and tools like Cisco Nexus Dashboard Insights and Nexus Dashboard Fabric Controller, offer an ideal platform for building a high-performance AI/ML network fabric.

The Network Design parameters

A two-tier, spine-switch-leaf-switch design with Cisco Nexus switches is recommended for building a non-blocking network with low latency and scalability. For a GPU cluster with 1024 GPUs, utilizing 8 GPUs per server, and 2 ports of 100Gbps per server, Cisco Nexus 93600CD-GX switches at the leaf layer and Nexus 9332D-GX2B switches at the spine layer are proposed.

Each Cisco Nexus 93600CD-GX switch has 28 ports of 100G for server connections and 8 uplinks of 400G, ensuring a non-blocking switch. To accommodate 256 server ports, 10 leaf switches are required, providing redundancy by dual-homing servers to two separate leaf switches. The network design also allows for connecting storage devices and linking the AI/ML server cluster to the enterprise network.

The spine switches, Cisco Nexus 9332D-GX2B, each have 20 x 400G ports, leaving 12 ports free for future expansion. Redundancy is achieved with four spine switches. The system is designed to support scalability, ensuring additional leaf switches can be added without compromising the non-blocking nature of the network.

The Cisco Nexus 93600CD-GX leaf switches and Cisco Nexus 9332D-GX2B spine switches have a low latency of 1.5 microseconds, providing a maximum end-to-end latency of ~4.5 microseconds for traffic crossing both leaf and spine switches. Congestion is managed by WRED ECN, preserving RoCEv2 transport latency on endpoints.

The network can easily expand by adding more leaf switches or doubling spine capacity with Cisco Nexus 9364D-GX2A switches. Alternatively, a three-tier design can interconnect multiple non-blocking network fabrics.

Designed for AI/ML workloads with a massively scalable data center (MSDC) approach, the network utilizes BGP as the control plane to Layer 3 leaf switches. For multi-tenant environments, an MP-BGP EVPN VXLAN network can be employed, allowing network separation between tenants. The principles described are suitable for both simple Layer 3 and VXLAN designs.

RoCEv2 as Transport for AI Clusters

RDMA is a widely used technology in high-performance computing and storage networking, offering high throughput and low-latency memory-to-memory communication between compute nodes. It minimizes CPU involvement by offloading the transfer function to network adapter hardware, bypassing the operating system’s network stack and resulting in reduced power requirements.

AI Clusters Require Lossless Networks

For RoCEv2 transport, a network needs to ensure high throughput and low latency while preventing traffic drops during congestion. Cisco Nexus 9000 switches, designed for data center networks, offer low latency and up to 25.6Tbps of bandwidth per ASIC, meeting the demands of AI/ML clusters using RoCEv2 transport. Additionally, these switches provide support and visibility in maintaining a lossless network environment through software and hardware telemetry for ECN and PFC.

Explicit Congestion Notification (ECN)

In scenarios requiring end-to-end congestion information propagation, Explicit Congestion Notification (ECN) is employed for congestion management. The ECN bits are set to 0x11 in the IP header’s Type of Service (TOS) field when congestion is detected, triggering a congestion notification packet (CNP) from the receiver to the sender. Upon receiving the congestion notification, the sender adjusts the flow accordingly. This built-in, efficient process occurs in the data path between two ECN-enabled endpoints.

Nexus 9000 switches, in the case of network congestion, mark packets with ECN bits using a congestion avoidance algorithm. The switch utilizes Weighted Random Early Detection (WRED) on a per-queue basis. WRED sets two thresholds in the queue: a minimum threshold for minor congestion and a maximum threshold. As buffer utilization grows, WRED marks outgoing packets based on drop probability, represented as a percentage of all outgoing packets. If congestion persists beyond the maximum threshold, the switch marks ECN to 0x11 on every outgoing packet, enabling endpoints to detect and respond to congestion.

Hosts 1-4 GPUs are sending RoCE traffic to Host 9’s GPU. The egress traffic on the port connected to Host 9 is oversubscribed, and the buffer on Leaf5 reaches WRED minimum. Leaf5 starts marking several of the packets headed to Host 9 with ECN 0x11 to indicate congestion in the data path. The NIC of Host 9, now starts sending Congestion Notification Protocol (CNP) packets back to the source (Hosts 1-4). As only some of the data packets were marked with congestion-experienced bits, the source reduces traffic throughput for that flow and continues to send packets. If congestion persists and buffer usage exceeds the WRED maximum threshold, the switch marks every packet with congestion-experienced bits.

This results in the sender receiving numerous CNP packets, prompting its algorithm to significantly reduce the data rate transmission towards the destination. This proactive measure helps alleviate congestion, allowing the buffer to gradually drain. As this occurs, the traffic rate should increase until the next congestion signal is triggered. Remember that in order to make this work properly everything in the path MUST support ECN and sending and processing CNP packets to reduce the traffic. Also, the WRED buffers must be configured properly so we don’t mark ECN too early or too late to keep traffic flowing properly.

How PFC Works

Enabling PFC on the Cisco Nexus 9000 switch designates a specific class of service for lossless transport, treating its traffic differently from other classes. Any port configured with PFC is assigned a dedicated no-drop queue and a dedicated buffer. For lossless capabilities, the queue features two thresholds. The xOFF threshold, positioned higher in the buffer, triggers the generation of a PFC frame when reached, and sent back to the traffic source. Once the buffer starts draining and falls below the xON threshold, pause frames are no longer sent, indicating the system’s assessment that congestion has subsided.

Leaf5 receives traffic coming from hosts 1-4. This leads to congestion on the port toward Host9. At this point, the switch uses its dedicated buffer to absorb incoming traffic. The traffic is buffered by Leaf5, and after the PFC xOFF threshold is reached it will send a pause frame to the upstream hop, in this diagram Spine-1. After Spine-1 receives the pause frame, it stops transmitting traffic, which prevents further congestion on the Leaf5. At the same time, Spine-1 starts buffering traffic in the no-drop queue, and after the buffer reaches the xOFF threshold, it sends pause frames down to Leaf1, sending traffic toward the spine switch. This hop-by-hop propagation of pause behavior continues, the leaf switches receive pause frames, and the switches pause transmission to the spine switch and start buffering traffic. This leads to Leaf1 sending pause frames to Host1-4.

After the senders receive a pause frame, the stream slows down. This allows the buffers to be drained on all the switches in the network. After each device reaches the xON threshold, that device stops propagating pause frames. After the congestion is mitigated and pause frames are no longer sent, Host X starts receiving traffic again.

The pause mechanism, triggered by network congestion or an endpoint with PFC, ensures that all devices in the path receive a pause frame before halting transmission to prevent packet drops. In rare instances, a PFC storm may occur if a misbehaving host continuously sends PFC frames, potentially causing network disruption. To address this, the PFC watchdog feature sets an interval to monitor if packets in a no-drop queue drain within a specified time. If the time is exceeded, all outgoing packets on interfaces matching the problematic PFC queue are dropped, preventing a network PFC deadlock.

Remember that in order to make this work properly everything in the path MUST support PFC and the ability to reduce the traffic. Also, the PFC buffers must be configured properly so we don’t enable PFC too early or too late to keep traffic flowing properly.

Using ECN and PFC Together to Build Lossless Ethernet Networks

As demonstrated in previous examples, both ECN and PFC excel at managing congestion independently, but their combined effectiveness surpasses individual performance. ECN reacts initially to alleviate congestion, while PFC acts as a fail-safe to prevent traffic drops if ECN’s response is delayed and buffer utilization increases. This collaborative congestion management approach is known as Data Center Quantized Congestion Notification (DCQCN), developed for RoCE networks.

The synergy between PFC and ECN ensures efficient end-to-end congestion control. During minor congestion with moderate buffer usage, WRED with ECN seamlessly manages congestion. However, severe congestion or microburst-induced high buffer usage triggers PFC intervention for effective congestion management. To optimize the functionality of WRED and ECN, appropriate thresholds must be set. In the example, WRED’s minimum and maximum thresholds are configured to address congestion initially, while the PFC threshold serves as a safety net post-ECN intervention. Both ECN and PFC operate on a no-drop queue, guaranteeing lossless transport.

Using Approximate Fair Drop (AFD)

An alternative method for congestion management involves utilizing advanced QoS algorithms, such as approximate fair drop (AFD) present in Cisco Nexus 9000 switches. AFD intelligently distinguishes high-bandwidth (elephant flows) from short-lived, low-bandwidth flows (mice flows). By identifying the elephant flows, AFD selectively marks ECN bits with 0x11 values, proportionally to the bandwidth utilized. This targeted marking efficiently regulates the flows contributing most to congestion, optimizing performance for minimal latency.

Compared to WRED, AFD offers the advantage of granular flow differentiation, marking only high-bandwidth elephant flows while sparing mice flows to maintain their speed. In AI clusters, allowing short-lived communications to proceed without long data transfers or congestion slowdowns is beneficial. By discerning and regulating elephant flows, AFD ensures faster completion of many transactions while avoiding packet drops and maintaining system efficiency.

Using NDFC to Build Your AI/ML Network

Regardless of the chosen network architecture, whether Layer 3 to the leaf switch or employing a VXLAN overlay, the Cisco Nexus Dashboard Fabric Controller (NDFC), also known as the Fabric Controller service, offers optimal configurations and automation features. With NDFC, the entire network, including QoS settings for PFC and ECN, can be configured rapidly. Additionally, the Fabric Controller service facilitates the automation of tasks such as adding new leaf or spine switches and modifying access port configurations.

To illustrate, consider the creation of a network fabric from the ground up using eBGP for constructing a Layer 3 network. This is achieved by utilizing the BGP fabric template.

To implement Quality of Service (QoS) across the entire fabric, the Advanced tab provides the option to select the template of your choice. This ensures uniform configuration across all switches, enabling consistent treatment of RoCEv2 traffic throughout the fabric. Some configurations need to be deployed using the freeform method, where native CLI commands are directly sent to the switch for setup. In this freeform configuration, the hierarchy and indentation must align with the structure of the running configuration on the switch.

Using Nexus Dashboard Insights to monitor and tune

Cisco Nexus Dashboard Insights offers detailed ECN mark counters at the device, interface, and flow levels. It also provides information on PFC packets issued or received on a per-class-of-service basis. This real-time congestion data empowers network administrators to fine-tune the network for optimal congestion response.

Utilizing the granular visibility of Cisco Nexus Dashboard Insights, administrators can observe drops and adjust and tune WRED or AFD thresholds to eliminate drops under normal traffic conditions—a crucial first step in ensuring effective handling of regular congestion in AI/ML networks. In scenarios of micro-burst conditions, where numerous servers communicate with a single destination, administrators can leverage counter data to refine WRED or AFD thresholds alongside PFC, achieving a completely lossless behavior. Once drops are mitigated, ECN markings and PFC RX/TX counters’ reports assist in further optimizing the system for peak performance.

The operational intelligence engine within Cisco Nexus Dashboard Insights employs advanced alerting, baselining, correlating, and forecasting algorithms, utilizing telemetry data from the networking and compute components. This automation streamlines troubleshooting, facilitating rapid root-cause analysis and early remediation. The unified network repository and compliance rules ensure alignment with operator intent, maintaining the network state.

Summary

The Cisco Data Center Networking Blueprint for AI/ML Applications white paper offers a comprehensive guide to designing and implementing a non-blocking lossless fabric to support RoCEv2 allowing for RMDA GPU-Direct operations.

Utilizing the Cisco Nexus 93600CD-GX leaf switches, as well as the Cisco Nexus 9332D-GX2B spine offers an end-to-end latency of 4.5 us in a spine leaf topology. Utilizing the proper number of uplinks and speeds ensures a non-blocking fabric. Utilizing ECN and PFC to implement Data Center Quantized Congestion Notification (DCQCN) is necessary to create a lossless fabric. Also, we can utilize WRED or the more powerful AFD(switches must support this) to ensure the building blocks to create a lossless fabric.

The use of NEXUS Dashboard Fabric Controller (NDFC) should be utilized for ease of deployment of the fabric as well as the advanced queuing templates available for AI/ML. Finally, once the fabric is up and running, the WRED or AFD queues need to be monitored and tuned to eliminate hot spots of congestion in the fabric. NEXUS Dashboard Insights is the tool of choice for this so we can look at Flow Table Events (FTE) from the NEXUS switches and we can monitor the links and look for increasing ECNs and PFCs and tune the queues appropriately. This will be an iterative process tuning the fabric as well as how the GPUs are used inter-node for training runs.

Optimizing AI/ML Performance: Cutting-Edge GPU Network Designs-Part 6-Infiniband and RoCE primer for the Network Engineer

AI/ML multi-node clusters have stringent infrastructure latency and packet loss requirements. The infrastructure connecting the GPUs compute node plays an important function in making large AI/ML jobs complete more quickly and, if designed correctly, mitigates the risks of large AI/ML jobs failing due to high latency or packet drops. To help network administrators optimize their AI/ML network for best performance as well as predict issues before they become service-impacting, it is imperative that the network system provides a deep level of visibility into the congestion management algorithms as well as the overall network system health. 

Remote direct memory access (RDMA) is a well-known technology used for high-performance computing (HPC) and storage networking environments to bypass CPU and gain direct memeory access. The advantages of RDMA are the high throughput and low latency transfer of information between compute nodes at the memory-to-memory level, without burdening the CPU. This transfer function is offloaded to the network adapter hardware to bypass the operating system software network stack. This technology also has the advantage of reduced power requirements.

AI applications take advantage of–and expect–low latency, lossless networks. To achieve this, network administrators need to deploy the right hardware and software features, along with a configuration that supports AI application needs. AI applications also need networks that can provide visibility into hot spots so they can be tuned as necessary. Finally, AI applications should take advantage of automation frameworks to make sure the entire network fabric is configured correctly and there is no configuration drift.

The network design has a considerable influence on the overall performance of an AI/ML cluster. In addition to the congestion management tools that were described earlier in this document, the network design should (must) provide a non-blocking fabric to accommodate all the throughput that GPU workloads require. This allows for less work needed from congestion management algorithms, (and from manual tuning) which in turn allows for faster completion of AI/ML jobs.

There are several ways to build a non-blocking network, but a two-tier, spine-switch-leaf-switch design provides the lowest latency and scalability. To illustrate this, we will use an example of a company that wants to create a GPU cluster with 1024 GPUs, 128 servers each 8 GPU and 2 x server ports along with storage devices that keep data that needs to be processed by the GPUs.

This means that 256 x server ports are required at the leaf layer for server connectivity. To make this a non-blocking network, the uplinks to the spine switches must have the same bandwidth capacity as the front panel server-facing ports. 

Infiniband overview

As we mentioned earlier;

“The de facto standard of supercomputer and GPU connectivity is InfiniBand and RDMA. While it works perfectly, it is a proprietary solution from Mellanox, acquired by NVidia, the world’s leading GPU maker. Also, Infiniband is a back-end network that connects compute nodes for GPU-GPU clustering using RMDA and storage connectivity. Infiniband also requires a front-end network to provide connectivity to the outside world.”

I can’t stress this enough “While RMDA over Converged Ethernet (RoCE) offers an alternative for Infiniband it requires careful setup, monitoring, and QOS management to prevent hot spots”

So let’s deep dive into Infiniband and see what makes it especially suitable for AI/ML workloads.

Subnet Manager

Infiniband was the first to implement true SDN using Subnet Manager. The InfiniBand Subnet Manager (SM) is a centralized entity running in the system. Each subnet needs one subnet manager to discover, activate, and manage the subnet. Each network requires a Subnet Manager to be running in either the system itself (system-based) or on one of the nodes that is connected to the fabric (host-based). The SM applies network traffic-related configurations such as QoS, routing, and partitioning to the fabric devices. You can view and configure the Subnet Manager parameters via the CLI/WebUI.

Adaptive Routing (AR) enables the switch to select the output port based on the port’s load. It assumes no constraints on output port selection (free AR). The NVIDIA subnet manager (SM) enables and configures the AR mechanism on fabric switches. It scans all the fabric switches and identifies which support AR, then it configures the AR functionality on these switches.

Fast Link Fault Recovery (FLFR) configured on the subnet manager enables the switch to select the alternative output port when the output port provided in the Linear Forwarding Table is not in an Armed/Active state. This mode allows the fastest traffic recovery for switch-to-switch port failures occurring due to link flaps or neighbor switch reboots without SM intervention.
Fault Routing Notification (FRN) enables the switch to report to neighbor switches that an alternative output port for the traffic to a specific destination LID should be selected to avoid sending traffic to the affected switch.

High Bandwidth

GPU-to -GPU connectivity requires very high bandwidth and low latency. With NDR at 400GB shipping today and the Infiniband Trade Association predicting speeds of 3,200GB by 2030 the speeds “should” keep up with Ethernet speeds in the foreseeable future.  InfiniBand’s ultra-low latencies, with measured delays of 600ns end-to-end, accelerate today’s mainstream HPC and AI applications. Most of today’s ethernet cut-through switches can offer a 4us end-to-end making an IB fabric lower latency.

CPU offload, GPU Direct, and RMDA

  As we discussed in part 4, RMDA can be used to bypass the CPU and allow remote applications to communicate directly.

RDMA is a crucial feature of InfiniBand that provides a way for devices to access the memory of other devices directly. With RDMA, data can be transferred between machines without intermediate software or hardware, significantly reducing latency and increasing performance. RDMA in InfiniBand is implemented using a two-step process. First, the sending device sends a message to the receiving device requesting access to its memory. The receiving device then grants access to its memory and returns a letter to the sender. Once access is granted, the sender can read or write directly to the memory of the receiving device.

GPU Direct offered by NVIDIA allows the GPUs to communicate directly inside the node using NVLink or across the Infiniband network using GPU Direct. Other vendors’ solutions must use the MPI libraries to implement GPU Direct but it is supported.

Scalability

Infiniband can scale to 48,00 nodes and beyond using an Infiniband router and multiple subnets

SHARP

SHARP allows the Infiniband network to process data aggregation and reduction operations as it traverses the network, eliminating the need for sending data multiple times between endpoints. This innovative approach not only decreases the amount of data traversing the network, but offers additional benefits, including freeing valuable CPU resources for computation rather than using them to process communication overhead. It also progresses such communication in an asynchronous manner, independent of the host processing state.

Without SHARP, data must be sent from each endpoint to the switch, back down to endpoints where collectives are computed, back up to the switch, and then back to the endpoints. That’s four traversal steps. With SHARP, the collective operations happen in the switch, so the number of traversal steps is halved: from endpoints to the switch and back. This leads to a 2x improvement in bandwidth for such collective operations and a reduction of 7x in MPI all reduce latency. The SHARP technology is integrated into most of the open-source and commercial MPI software packages as well as OpenSHMEM, NCCL, and other IO frameworks.

QOS

The InfiniBand architecture defined by IBTA includes Quality of Service and Congestion Control features that are tailored perfectly to the needs of Data Center Networks. Through the use of features available today in InfiniBand Architecture-based hardware devices, InfiniBand can address QoS requirements in Data Center applications in an efficient and simpler way, better than any other interconnect technology available today.

Unlike in the more familiar Ethernet domain where QoS and associated queuing and congestion handling mechanisms are used to rectify and enhance the best-effort nature of delivery service,

InfiniBand starts with a slightly different paradigm. First of all, the InfiniBand architecture includes QoS mechanisms inherently. What this means is that QoS mechanisms are not extensions as in the case of IEEE 802.3-based Ethernet that only defines a best-effort delivery service. Because of the inherent inclusion of QoS in the base layer InfiniBand specification, there are two levels to QoS implementation in InfiniBand hardware devices:

  1. QoS mechanisms are inherently built into the basic service delivery mechanism supported by the hardware, and
  2. Queuing services and management for prioritizing flows and guaranteeing service levels or bandwidths

InfiniBand is a loss-less fabric, that is, it does not drop packets during regular operation. Packets are dropped only in instances of component failure. As such, the disastrous effects of retries and timeouts on data center applications are non-existent. When starting to play InfiniBand QoS, you will need to understand two basic concepts. Service levels (SLs) and Virtual Lanes (VLs).

Service Level (SL)
A field in the Local Routing Header (LRH) that defines the requested QoS. It permits a packet to operate at one of 16 SLs. The assignment of SLs is a function of each node’s communication manager (CM) and its negotiation with the SM; it does not define any methods for traffic classification in sending nodes. This is left to the implementer.

Virtual Lane (VL)

Virtual lanes provide a mechanism for creating multiple channels within a single physical link. Virtual Lanes enable multiple independent data flows to share the same link and separate buffering and flow control for each flow. A VL Arbiter is used to control the link usage by the appropriate data flow. Virtual lane arbitration is the mechanism an output port utilizes to select from which virtual lane to transmit. IBA specifies a dual-priority weighted round-robin (WRR) scheme.

  • Each VL uses a different buffer
  • In the standard, there are 16 possible VLs.
    • VLs 0-14 –used for traffic (0-7 are the only VLs that are currently implemented by Mellanox switches)
    • VL 15 –used for management traffic only

Packets are assigned to a VL on each IB device they traverse based on their SL, input port, and output port. This mapping allows (among other things) seamless interoperability of devices that support different numbers of VLs. The mapping is architected for very efficient HW implementation.

It’s important to understand that Infiniband is NOT a dead technology and should be evaluated for any HPC and AI/ML multi-node GPU infrastructure.

Ethernet using RMDA over Converged Ethernet(RoCE)

RDMA, a key technology in high-performance computing and storage networking, ensures high throughput and low-latency memory-to-memory information transfer between compute nodes. It offloads the transfer function to network adapter hardware, reducing CPU burden and power requirements. InfiniBand (IB) initially pioneered RDMA, becoming the preferred high-performance computing transport. However, InfiniBand necessitated a separate, costly network for enterprise networks requiring HPC workloads.

RDMA offers various implementations, including Ethernet-based RDMA over Converged Ethernet (RoCE). RoCEv2, introduced in 2014, extends InfiniBand benefits and allows routing over Ethernet networks. RoCE is an extension of InfiniBand with Ethernet forwarding. RoCEv2 encapsulates IB transport in Ethernet, IP, and UDP headers, so it can be routed over Ethernet networks.

Its implementation in data center networks is driven by Ethernet’s familiarity, affordability, and the possibility of creating a converged fabric for regular enterprise traffic and RDMA workloads. In AI/ML clusters, RDMA is used to communicate memory-to-memory between GPUs over the network using GPUDirect RDMA. RoCEv2 in AI/ML clusters, particularly for GPUDirect RDMA, demands a lossless fabric achieved through explicit congestion notification (ECN) and priority flow control (PFC) algorithms.

Explicit Congestion Notification (ECN)

ECN is a feature between two ECN-enabled endpoints, such as switches, which can mark packets with ECN bits during network congestion. An ECN-capable network node employs a congestion avoidance algorithm, marking traffic contributing to congestion after reaching a specified threshold. ECN facilitates end-to-end congestion notification between ECN-enabled senders and receivers in TCP/IP networks. However, proper ECN functionality requires enabling ECN on all intermediate devices, and any device lacking ECN support breaks end-to-end ECN.

ECN notifies networks of congestion, aiming to reduce packet loss and delay by prompting the sending device to decrease the transmission rate without dropping packets. WRED operates on a per-queue level, with two thresholds in the queue. The WRED minimum threshold indicates minor congestion, marking outgoing packets based on drop probability. If congestion persists and buffer utilization surpasses the WRED maximum threshold, the switch marks ECN on every outgoing packet. This WRED ECN mechanism enables endpoints to learn about congestion and react accordingly.

ECN is marked in the network node where congestion occurs within the IP header’s Type of Service (TOS) field’s two least significant bits. When a receiver receives a packet with ECN congestion experience bits set to 0x11, it sends a Congestion Notification Packet (CNP) back to the sender. Upon receiving the notification, the sender adjusts the flow to alleviate congestion, constituting an efficient end-to-end congestion management process. If congestion persists beyond the WRED maximum threshold, every packet is marked with congestion bits, significantly reducing data rate transmission. This process alleviates congestion, allowing the buffer to drain and traffic rates to recover until the next congestion signal.

Priority Flow Control (PFC)

Priority Flow Control was introduced in Layer 2 networks as the primary mechanism to enable lossless Ethernet. Flow control was driven by the class of service (COS) value in the Layer 2 frame, and congestion is signaled and managed using pause frames, and a pause mechanism. However, building scalable Layer 2 networks can be a challenging task for network administrators. Because of this, network designs have mostly evolved into Layer 3 routed fabrics.

Priority-Based Flow Control (PFC) is a congestion relief feature ensuring lossless transport by offering precise link-level flow control for each IEEE 802.1p code point (priority) on a full-duplex Ethernet link. When the receive buffer on a switch interface reaches a threshold, the switch transmits a pause frame to the sender, temporarily halting transmission to prevent buffer overflow. The buffer threshold must allow the sender to stop transmitting frames before overflow occurs, and the switch automatically configures queue buffer thresholds to avoid frame loss.

In cases where congestion prompts the pause of one priority on a link, other priorities continue to send frames, avoiding a complete transmission halt. Only frames of the paused priority are held back. When the receive buffer falls below a specific threshold, the switch sends a message to resume the flow. However, pausing traffic, depending on link traffic or priority assignments, may lead to ingress port congestion and propagate congestion through the network.

With PFC enabled, a dedicated class of service ensures lossless transport, treating traffic in this class differently from other classes. Ports configured with PFC receive a dedicated no-drop queue and buffer. To maintain lossless capabilities, the queue has two thresholds: the xOFF threshold, set higher in the buffer, triggers PFC frames to halt traffic towards the source, and the xON threshold, reached after the buffer starts draining, stops the transmission of pause frames, indicating the system’s belief that congestion has subsided.

Priority Flow Control (PFC) was introduced in Layer 2 networks to enable lossless Ethernet, driven by the Class of Service (COS) value in Layer 2 frames. Congestion is signaled and managed through pause frames and a pause mechanism. However, building scalable Layer 2 networks poses challenges, leading to the evolution of network designs into Layer 3 routed fabrics. As RoCEv2 supports routing, PFC is adapted to work with Differentiated Services Code Point (DSCP) priorities for signaling congestion between routed hops. DSCP, used for classifying network traffic on IP networks, employs the 6-bit differentiated services field in the IP header for packet classification. Layer 3 marking ensures traffic maintains classification semantics across routers. PFC frames, utilizing link-local addressing, enable pause signaling for routed and switched traffic. PFC is transmitted per hop, from the congestion point to the traffic source. While this step-by-step process may take time to reach the source, PFC is the primary tool for congestion management in RoCEv2 transport.

Data Center Quantized Congestion Notification (DCQCN)

The collaborative approach of Priority-Based Flow Control (PFC) and Explicit Congestion Notification (ECN), managing congestion together, is termed Data Center Quantized Congestion Notification (DCQCN), explicitly developed for RoCE networks. This joint effort of PFC and ECN ensures efficient end-to-end congestion management. During minor congestion with moderate buffer usage, WRED with ECN seamlessly handles congestion. PFC gets activated to manage congestion in cases of more severe congestion or microbursts causing high buffer usage. To optimize the functioning of both PFC and ECN, appropriate thresholds must be set.

In the example provided, WRED minimum and maximum thresholds get configured for lower buffer utilization to address congestion initially, while the PFC threshold is set higher as a safety net to manage congestion after ECN. Both ECN and PFC operate on a no-drop queue, ensuring lossless transport.

Data Center Quantized Congestion Notification (DCQCN) combines ECN and PFC to achieve end-to-end lossless Ethernet. ECN aids in overcoming the limitations of PFC for achieving lossless Ethernet. The concept behind DCQCN is to enable ECN to control flow by reducing the transmission rate when congestion begins, minimizing the time PFC is triggered, which would otherwise halt the flow entirely.

DCQCN Requirements

Achieving optimal performance for Data Center Quantized Congestion Notification (DCQCN) involves balancing conflicting requirements, ensuring neither premature nor delayed triggering of Priority-Based Flow Control (PFC). Three crucial parameters must be carefully calculated and configured:

1. **Headroom Buffers:** When a PAUSE message gets sent upstream, it takes time to be effective. The PAUSE sender should reserve sufficient buffer capacity to process incoming packets during this interval to prevent packet drops. QFX5000 Series switches allocate headroom buffers on a per-port, per-priority basis, carved from the global shared buffer. Control over headroom buffer allocation can be exercised using MRU and cable length parameters in the congestion notification profile. In case of minor ingress drops persisting post-PFC triggering, resolving them involves increasing headroom buffers for the specific port and priority combination.

2. **PFC Threshold:** Representing an ingress threshold, this threshold signifies the maximum size an ingress priority group can attain before triggering a PAUSE message to the upstream device. Each PFC priority has its priority group at each ingress port. PFC thresholds typically consist of two components—the MIN threshold and the shared threshold. PFC is activated for a corresponding priority when both MIN and shared thresholds are reached, with a RESUME message sent when the queue falls below these PFC thresholds.

3. **ECN Threshold:** This egress threshold equals the WRED start-fill-level value. Once an egress queue surpasses this threshold, the switch initiates ECN marking for packets in that queue. For effective DCQCN operation, this threshold must be lower than the ingress PFC threshold, ensuring ECN marking precedes PFC triggering. Adjusting the WRED fill level, such as setting it at 10 percent with default shared buffer settings, enhances ECN marking probability. However, a higher fill level, like 50 percent, may impede ECN marking, as ingress PFC thresholds would be met first in scenarios with two ingress ports transmitting lossless traffic to the same egress port.

Summary

In summary, the interconnecting “network” between GPU nodes has to offer the following requirements;

  • Offer high bandwidth and low latency, congestion-free, non-blocking, and not oversubscribed.
  • Offer support for RMDA
  • Provide high visibility and the ability to adjust to reduce hot spots.
  • Spine-leaf design for lowest latency and scalability
  • Provide an end-to-end solution from host to host (Ethernet or HCA Infiniband cards supporting RMDA, and priority flow control (PFC)
  • 100GB to the host, 400 or 800GB uplinks from leafs to spines.
  • The number and the backplane throughput of spines must be greater than the front panel ports of the leafs connected. (Non-blocking)
  • L3 utilizing ECMP, or even better a switch platform with ASICs that can do bit spraying across the links. (few options today)
  • Infiniband is potentially a better turn-key solution, however requires a dedicated network just for the node to node GPU connectivity.
  • RoCE is a viable alternative but requires a lot of custom tuning of ECN and PFC to eliminate hot spots of congestion during full training runs.

Optimizing AI/ML Performance: Cutting-Edge GPU Network Designs-Part 2-CPU and GPU usage in AI/ML for the Network Engineer

Machine learning (ML) is advancing rapidly, enhancing artificial intelligence capabilities. The challenge for business owners exploring ML lies in understanding how to leverage it effectively. ML improves efficiency, productivity, and speed, yet its technical nature can be overwhelming. ML involves complex algorithms tailored to specific situations.

Various businesses may use similar algorithms in the current landscape for everyday purposes. The potential for algorithm sharing arises, but sensitive or proprietary information complicates this process. The question arises: can ML algorithms be easily shared among technologists, similar to open-source platforms like GitHub for coders? The answer is yes, with existing platforms where ML teams share training data sets and open-source libraries with generic algorithms from Google/Tensor Flow, Azure, and AWS.

However, most ML algorithms are specific to each business, particularly how data is labeled. Companies, even within the same industry, operate uniquely. While collaborative sharing occurs in coding, ML presents challenges due to its business-specific nature. Despite differences, the potential for global collaboration in ML exists, similar to how open-source code collaboration improves efficiency and innovation. Like any technology, ML has pros and cons, with room for improvement through widespread partnership to create more effective and beneficial solutions for the world.

GPU vs CPU

We know that neural networks can be used to create complex mathematical algorithms, and it is up to the data scientists to develop the algorithms and implement the neural networks using complex mathematical formulas such as Linear Algebra, Fast Forum Transfer, multivariate calculus, and statistics. These algorithms are performed at the various neurons in the neural network. Below is a typical flowchart of data input, computations, and saves. In this example, a GPU is inserted into the neural network flow to speed up memory transfer.

CPU only VS CPU and GPU

If we look at the timing of a neural network processing a single run through various neural nodes, we can see that offloading some of the work to the GPU significantly speeds up the process. Here, we see an approximate speed increase of 8X. Now remember this is a single run-through of a single neural network path from input to hidden layer to output. Now multiply this a million, billion, or even trillion times through the neural network. We can see a massive advantage in offloading neural networking functions onto a GPU. The question is, why is a GPU faster than a CPU?


GPUs excel in speed due to their matrix multiplication and convolution efficiency, primarily driven by memory bandwidth rather than parallelism. While CPUs are latency-optimized, GPUs prioritize bandwidth efficiency, resembling a big truck compared to a Corvette. The CPU quickly fetches small amounts of memory, whereas the GPU fetches more significant amounts at once, leveraging its superior memory bandwidth, reaching up to 750GB/s compared to the CPU’s 50GB/s.

Thread parallelism is crucial for GPUs to overcome latency challenges. Multiple GPUs can function together, forming a fleet of trucks and enabling efficient memory access. This parallelism effectively conceals latency, allowing GPUs to provide high bandwidth for large memory chunks with minimal latency drawbacks. The second step involves transferring memory from RAM to local memory on the chip (L1 cache and registers). While less critical, this step still contributes to the overall GPU advantage.

In computation, GPUs outperform CPUs due to their efficient use of register memory. GPUs can allocate a small pack of registers for each processing unit, resulting in an aggregate GPU register size over 30 times larger than CPUs, operating at twice the speed. This allows GPUs to store substantial data in L1 caches and register files, facilitating the reuse of convolutional and matrix multiplication tiles. The hierarchy of importance for GPU performance lies in high-bandwidth main memory, hiding memory access latency under thread parallelism, and the presence of large, fast, and easily programmable register and L1 memory. These factors collectively make GPUs well-suited for deep learning.

To understand why we use GPUs in deep machine learning and neural networks, we need to see what happens at a super low level of loading numbers and functions from DRAM memory into the CPU so the CPU can perform the algorithm at that neural node.

Looking at a typical CPU and memory, we can see that the memory bandwidth is 200GBytes / sec and 25 Giga-FP64, while the CPU can perform 2000 GFLOPs FP64. FLOPS (Floating-point Operations Per Second) is a performance metric indicating the number of floating-point arithmetic calculations a computer processor can execute within one second. Floating-point arithmetic refers to calculations performed on representations of real numbers in computing. For the CPU to actually operate at 2000GFLOPs, we have to be able to feed the CPU from memory. The memory can supply 25 billion double precision floating point values (FP-64), while the CPU can process 2000 billion double precision floating point values.

Difference between single and double precision values or FR-64 vs FP-32

Compute intensity and memory bandwidth

Because we cannot feed the CPU fast enough from memory, the CPU would need to do 80 operations on that number to break even for every number loaded from memory. There are very few, if any, algorithms that need 80 functions, so we have severely limited memory bandwidth. This number is called Compute intensity.

Let’s look at a simple function called DAXPY, which is simply alpha*x+y=z. There are 2 CPU FLOPs required (multiply and add) and 2 memory loads (X and Y). This is a core operation, so it has the instruction set for a single function called FMA (Fused multiply add). Looking at the diagram, we have 2 loads from memory (DRAM) and a very long time(memory latency) to traverse the circuit board and into the CPU for the x and y to be ready. We perform the 2 flops of (alpha*X)+y very quickly and then output the result. The memory bottleneck hinders speed if we have an extensive neural network with massive datasets and complicated algorithms.

So why is there a memory bandwidth latency constraint? Physics! First, let’s look at the fundamentals of a computer motherboard. With a computer clock speed of 3GHZ, and the speed of electricity in silicon is 60 million M/S, then in 1 clock tick, electricity can travel 20mm (0.8 inches). If we look at a typical chip which is 22×32 mm and the length of the copper from the CPU to DRAm is 50-100mm, we can see it will take 5-10 clock ticks just for the ones and zeros of the data to traverse the CPU and motherboard.

If we put this together, we can calculate how much data can be moved across the memory bus. When performing DAXPY in a typical Intel Xeon 8280 CPU, the memory bandwidth is 131 GB/sec, and the memory bus latency is 89ns. This means we can move 11,659 bytes in 89ns. DAXPY inside the CPU moves 16 bytes per 89ns of latency, which means our memory efficiency is 0.14%! This also means the memory bus is idle 99.86% of the time!

To keep the memory bus busy, we must run 729 iterations of daxpy at once.

Threads to the rescue!

Modifying the instructions and running parallel operations allows us to utilize multiple threads to keep our memory bus busy. How busy is only limited by max threads of the processor or GPU and memory requests made by the instruction set.

One thing stands out when comparing a standard CPU with the NVIDIA A100 GPU. We see a huge difference in memory efficiency between the CPU and GPU. The CPUs have better memory efficiency but have significantly fewer threads available. This is the most critical point in comparing CPU and GPU, and if you take anything away, the massive number of threads in a GPU allows faster processing. They both are trying to fix the problem of keeping the memory bus busy. The GPU has much worse memory efficiency but has 5x more threads than needed.

If we examine BW and latency in an NVIDIA A100 GPU, we see that the on-chip latency and B/W are significantly better than accessing off-chip DRAm via the PCIe bus. The L1 Cache has a 19,400 GB/Sec BW, while the PCIe BW is 25 BG/Sec! The hardware designers intentionally created more threads for the GPU so we could satisfy the CPUs on the GPU and keep the entire memory system busy.

SMs, Warps, and oversubscription

We can have hundreds of Streaming multiprocessors (SM) inside the GPU. We have 64 warps inside the SM; each warp runs 32 threads. This is how a GPU can process hundreds of thousands of threads while a CPU has less than 1000 threads. We can use the concept of parallelism (more on this later) to run 10’s thousands of threads to keep the memory bus busy, and we can even over-subscribe the bus to get even more threads utilized. We can use 4 warps concurrently, so for every clock tick, 4 warps running 32 threads each can perform Flops. This is how the GPU designers fought the memory latency by utilizing threads.

The GPU’s secret is oversubscription. In a GPU, we can have 432 active Warps with 13,824 active threads and 6,480 waiting warps with 207,360 waiting threads. The GPU can switch from one warp to the next in a single clock cycle and process the successive 432 warps. The GPU is designed as a throughput machine utilizing oversubscription, while the CPU is a latency machine where oversubscription is terrible. We can compare the CPU to a car with one occupant that takes 45 minutes to reach its destination. The GPU compares to a train that makes stops and takes on more occupants but is slower. In the end, both get the destination, but the GPU, while slower, significantly carries more occupants (threads). These threads are the computations needed in machine learning, such as daxpy.

In machine learning, they operate in various ways, such as Element-wise in daxpy, Local in Convolution, and All-to-All in Fast Fourier Transform.

Let us understand how parallelism works and how we can use all these threads simultaneously. Let’s say we have trained an AI to recognize a dog. We overlay the dog with a grid, creating many work blocks. The blocks are executed independently, and the GPU is oversubscribed with blocks of work. (The GPU is designed to handle oversubscription). Each block has many threads working together in each block using LOCAL (GPU memory-L1, L2, and HBM) for data sharing.

Utilizing CUDA’s hierarchical execution, we can see how we can use parallelism to break the work into blocks with equal threads. Then, the block works independently and in Parallel. These threads run independently in parallel and may or may not synchronize to exchange data from other blocks of threads. Also, this is kept on the GPU, taking advantage of the thread oversubscription and on-chip memory.

Summary

This post shows how the GPU can significantly reduce execution times by utilizing parallelism and thread oversubscription. The GPU can beat the memory bus latency issue by using on-chip memory and threads to run work directly on the GPU. We will next discuss parallel computing and how we can leverage multiple onboard GPUs and, finally, GPUs on other computers via Infiniband or Ethernet networking.

Optimizing AI/ML Performance: Cutting-Edge GPU Network Designs- Part 1-AI/ML and Neural Networks for the Network Engineer

Hopefully, everyone has been paying attention to the news and is somewhat familiar with artificial intelligence (AI). Today, interest in and applications of AI are at an all-time high, with breakthroughs happening daily. ChatGPT has celebrated a birthday and turned a year old, and the amount of AI-related content has grown exponentially. For those readers who have not been keeping up with AI, breakthroughs in AI and the consumption of AI-enhanced technologies by everyone on the planet are about to accelerate at a rate that will necessitate companies and organizations to bring computing, storage, and networking for AI systems back in the house. Today, most companies and organizations are starting to use cloud-based AI using Machine Learning (ML) on what is known as Large Language Models (LLMs).

This series of 6 posts will guide the network engineer through understanding the basics of AI/ML, covering topics such as artificial neurons, neural networks, GPU and memory latency, parallel and distributed computing, and network design best practices.

Servers and infrastructure costs will exponentially increase.

Everyone needs to take note of the following as it will affect data center design and the cost to the business. Tirias Research projects that by 2028, the combined infrastructure and operational expenses for generative AI data center servers alone will surpass $76 billion, posing challenges to the business models and profitability of emerging services like search, content creation, and business automation incorporating Generative AI. This forecast is based on an aggressive 4X improvement in hardware compute performance and anticipates a 50X increase in processing workloads, resulting in a substantial rise in data center power consumption to nearly 4,250 megawatts. This cost estimate, which exceeds twice the estimated annual operating cost of Amazon’s AWS, encompasses labor, power, cooling, ancillary hardware, and 3-year amortized server costs, excluding the data center building structure. The model utilized for this forecast is grounded in server benchmarks using 10 Nvidia GPU accelerators with specific power consumption details. Everyone from the C-Level down needs to plan for this infrastructure cost.

With the exponential growth in GenAI demand, relying on breakthroughs in processing or chip design appears to be challenging, given the slowdown of Moore’s Law. Meeting the rising consumer expectations for improved GenAI output is not without trade-offs, as efficiency and performance gains may be counteracted. As consumer usage escalates, inevitably, costs will also rise.

What are AI, ML, and DL?

Artificial Intelligence (AI) integrates human-like intelligence into machines through algorithms. “Artificial” denotes human-made or non-natural entities, while “Intelligence” relates to the ability to understand or think. AI, defined as the study of training machines (computers) to emulate human brain functions, focuses on learning, reasoning, and self-correction for maximum efficiency.

Machine Learning (ML), a subset of AI, enables computers to autonomously learn from experiences without explicit programming. ML develops programs that access data, autonomously making observations to identify patterns and enhance decision-making based on provided examples, aiming for systems to learn through experience without human intervention.

Deep Learning (DL), a sub-part of Machine Learning, leverages Neural Networks to mimic brain-like behavior. DL algorithms process information patterns akin to human cognition, identifying and classifying data on larger datasets. DL machines autonomously handle the prediction mechanism, operating on a broader scale than ML.

How imminent is the development of artificial intelligence that surpasses the human mind? The succinct response is not in the immediate future, yet the pace of progress has notably accelerated since the inception of the modern AI field in the 1950s. During the 1950s and 1960s, AI experienced significant advancements as computer scientists, mathematicians, and experts from diverse fields enhanced algorithms and hardware. Despite early optimism about the imminent realization of a thinking machine comparable to the human brain, this ambitious goal proved elusive, leading to a temporary decline in support for AI research. The field witnessed fluctuations until it experienced a resurgence around 2012, primarily fueled by the revolutionary advancements in deep learning and neural networks.

AI vs. Machine Learning vs. Deep Learning Examples:

Artificial Intelligence (AI): AI involves creating computer systems capable of human-like intelligence, tackling tasks that typically require human cognitive abilities. Examples include:

  1. Speech Recognition: Employed in self-driving cars, security systems, and medical imaging for classifying and transcribing speech using deep learning algorithms.
  2. Personalized Recommendations: Platforms like Amazon and Netflix use AI to analyze user history for customized product and content recommendations.
  3. Predictive Maintenance: AI-powered systems predict equipment failures by analyzing sensor data, reducing downtime and maintenance costs.
  4. Medical Diagnosis: AI analyzes medical images to aid doctors in making accurate diagnoses and treatment plans.
  5. Autonomous Vehicles: Self-driving cars use AI algorithms to make navigation, obstacle avoidance, and route planning decisions.
  6. Virtual Personal Assistants (VPA): Siri and Alexa use natural language processing to understand and respond to user requests.

Machine Learning (ML): ML, a subset of AI, involves algorithms and statistical models that enable computers to learn and improve performance without explicit programming. Examples include:

  1. Image Recognition: ML algorithms classify images, applied in self-driving cars, security systems, and medical imaging.
  2. Speech Recognition: ML algorithms transcribe speech in virtual assistants and call centers.
  3. Natural Language Processing (NLP): ML algorithms understand and generate human language, applied in chatbots and virtual assistants.
  4. Recommendation Systems: ML algorithms analyze user data for personalized e-commerce and streaming services.
  5. Sentiment Analysis: ML algorithms classify text or speech sentiment as positive, negative, or neutral.
  6. Predictive Maintenance: ML algorithms analyze sensor data to predict equipment failures and reduce downtime.

Deep Learning: Deep Learning, a type of ML, employs artificial neural networks with multiple layers for learning and decision-making. Examples include:

  1. Image and Video Recognition: Deep learning algorithms classify and analyze visual data in self-driving cars, security systems, and medical imaging.
  2. Generative Models: Deep learning algorithms create new content based on existing data, applied in image and video generation.
  3. Autonomous Vehicles: Deep learning algorithms analyze sensor data for decision-making in self-driving cars.
  4. Image Classification: Deep learning recognizes objects and scenes in images.
  5. Speech Recognition: Deep learning transcribes spoken words into text for voice-controlled interfaces and dictation software.
  6. Recommender Systems: Deep learning algorithms make personalized recommendations based on user behavior and preferences.
  7. Fraud Detection: Deep learning algorithms detect fraud patterns in financial transactions.
  8. Game-playing AI: Deep learning algorithms power game-playing AI that competes at superhuman levels.
  9. Time Series Forecasting: Deep learning forecasts future values in time series data, such as stock prices and weather patterns.

Machine Learning (ML) and neural networks

If we look at what AI consists of, we see that computer programs can learn and reason like humans. Part of AI is Machine Learning or ML. ML are mathematical algorithms that can learn WITHOUT being explicitly programmed by humans. Finally, we have deep learning inside machine learning, which uses artificial neural networks to adapt and learn from vast datasets. This learning is called training using enormous datasets.

The concept of artificial neural networks naturally originated as a representation of the functioning of neurons in the brain, referred to as ‘connectionism.’ This approach utilized interconnected circuits to emulate intelligent behavior, illustrated in 1943 through a primary electrical circuit by neurophysiologist Warren McCulloch and mathematician Walter Pitts. Donald Hebb expanded upon this concept in his 1949 book, “The Organization of Behaviour,” suggesting that neural pathways strengthen with each repeated use, particularly between neurons that tend to activate simultaneously. This marked the initial steps in the extensive endeavor to quantify the intricate processes of the brain.

A neural network emulates densely interconnected brain cells in a computer, learning autonomously without explicit programming. Unlike a human brain, neural networks are software simulations on ordinary computers, utilizing traditional transistors and logic gates to mimic billions of interconnected cells. This simulation contrasts with the structure of a human brain, as no attempt has been made to build a computer mirroring the brain’s densely parallel arrangement. The analogy between a neural network and a human brain resembles the difference between a computer weather model and actual weather elements. Neural networks are essentially collections of algebraic variables and equations within a computer, devoid of meaning to the computer itself but significant to programmers.

Artificial neurons (Perceptron)

The brain has approximately 10 billion interconnected neurons, each linked to around 10,000 others. Neurons receive electrochemical inputs at dendrites, and if the cumulative signal surpasses a certain threshold, the neuron fires an electrochemical signal along the axon, transmitting it to connected neurons. The brain’s complexity arises from these interconnected electrochemical transmitting neurons, where each neuron functions as a simple processing unit, performing a weighted sum of inputs and firing a binary signal if the total information surpasses a threshold. Inspired by this model, artificial neural networks have shown proficiency in tasks challenging for traditional computers, like image recognition and predictions based on past knowledge. However, they have yet to approach the intricacy of the brain.

The Perceptron serves as a mathematical emulation of a biological neuron. Unlike the physical counterparts where dendrites receive electrical signals, the Perceptron represents these signals numerically. The modulation of electrical signals at synapses mimicked in the Perceptron, involves multiplying each input value by weight. Similar to the biological process, a Perceptron generates an output signal only when the total strength of input signals surpasses a set threshold. This is simulated by calculating the weighted sum of inputs and applying a step function to determine the output. The output is then transmitted to other perceptrons, akin to the interconnected nature of biological neural networks.

A biological neuron
An artificial neuron (Perceptron)
Biological neurons compared to Artificial neurons

In artificial neural networks, a neuron functions as a mathematical entity mirroring the behavior of biological neurons. Typically, it calculates the weighted average of its inputs, passing this sum through a nonlinear activation function like the sigmoid. The neuron’s output then serves as input for neurons in another layer, perpetuating a similar computation (weighted sum and activation function). This computation entails multiplying a vector of input/activation states by a weight matrix and processing the resulting vector through the activation function.

Mathematical representation of a Neural Network Neuron.


AlexNet, a convolutional neural network (CNN) architecture, was crafted by Alex Krizhevsky in partnership with Ilya Sutskever and Geoffrey Hinton, Krizhevsky’s Ph.D. advisor at the University of Toronto. In 2012, Krizhevsky and Sutskever harnessed the capabilities of two GeForce NVIDIA GPU cards to create an influential visual recognition network, transforming the landscape of neural network research. Before this, neural networks were predominantly trained on CPUs, and the shift to GPUs marked a pivotal moment, paving the path for the advancement of sophisticated AI models.

Nvidia is credited with creating the first “true” GPU in 1999, marking a significant leap from earlier attempts in the 1980s to enhance output and video imagery. Nvidia’s research led to a thousandfold increase in computational speeds over a decade, with a notable impact in 2009 when it supported the “big bang of deep learning.” GPUs, crucial in machine learning, boast about 200 times more processors than CPUs, enabling parallel processing and significantly accelerating tasks like neural network training. Initially designed for gaming, GPUs have become instrumental in deep learning applications due to their similar processing capabilities and large on-board RAM.

Neural network overview

Neural networks are a transformative force in machine learning, driving advancements in speech recognition, image processing, and medical diagnosis. The rapid evolution of this technology has led to robust systems mimicking brain-like information processing. Their impact spans various industries, from healthcare to finance to marketing, addressing complex challenges innovatively. Despite these strides, we’ve only begun to unlock the full potential of neural networks. Diverse in structure, neural networks come in various forms, each uniquely tailored to address specific problems and data types, offering real-world applications across domains.

Neural networks, a subset of deep learning, emulate the human brain’s structure and function. Comprising interconnected neurons, they process and transmit information, learning patterns from extensive datasets. Often considered a black box due to their opaque workings, neural networks take various inputs (images, text, numerical data) and generate outputs, making predictions or classifications. Proficient at pattern recognition, neural networks excel in solving complex problems with large datasets, from stock market predictions to medical image analysis and weather forecasting.

Neural networks learn and improve through training, adjusting weights and biases to minimize prediction errors. In this process, neurons in the input layer represent different data aspects, such as age and gender, in a movie preference prediction scenario. Weights signify the importance of each input, with higher weights indicating greater significance. Biases, akin to weights, are added before the activation function, acting as default values that adjust predictions based on overall data tendencies. For instance, if the Network recognizes gender-based preferences, it may add a positive bias for predicting movie preferences in women, reflecting learned patterns from the data.

A neural network typically comprises three layers: the input layer for data input, the hidden layer(s) for computation, and the output layer for predictions. The hidden layer(s) perform most calculations, with each neuron connected to those in the preceding layer. During training, weights and biases of these connections are adjusted to enhance network performance. The configuration of hidden layers and neurons varies based on the problem’s complexity and available data. Deep neural networks, featuring multiple hidden layers, excel in complex tasks like image recognition and natural language processing.

Boolean functions

The Perception or Neuron can perform various mathematical functions using gates and Boolean Algebra using the AND, OR, XOR, NAND, NOR, and XNOR functions.

We can build our neural networks using these functions in each neural node.

Types of Neural Networks

In graphical representations of artificial neural networks, neurons are symbolized as circles, acting as “containers” for mathematical functions. Layers, comprised of one or more neurons, are often organized in vertical lines. In simpler feedforward neural networks, input layer neurons process numeric data representations, employing activation functions and passing outputs to subsequent layers. Hidden layer neurons, receiving inputs from previous layers, perform similar functions and transmit results to successive layers. Depending on the Network’s purpose, output layer neurons process information and generate outcomes ranging from binary classifications to numeric predictions. Neurons can belong to input (red circles), hidden (blue circles), or output (green circles) layers. Each computer or cluster may be depicted as a single neuron in graphical representations in complex systems.

If we look at neural networks, many varieties can perform many functions required for deep learning. They all use the same basic building blocks based on the Perceptron model.

Feedforward Neural Networks consist of an input layer, one or more hidden layers, and an output layer, facilitating data flow in a forward direction. Commonly used for tasks like image and speech recognition, natural language processing, and predictive modeling, they adjust neuron weights and biases during training to minimize error. The Perceptron, one of the earliest neural networks, is a single-layer network suitable for linearly separable problems. The Multilayer Perceptron (MLP) is a feedforward network with multiple perceptron layers prevalent in image recognition and natural language processing.

Recurrent Neural Networks (RNNs) process sequential data with recurrent neurons, excelling in language translation and text generation tasks. Long Short-Term Memory (LSTM) neural networks, a type of RNN, handle long-term dependencies and are effective in natural language processing. Radial Basis Function (RBF) Neural Networks transform inputs using radial basis functions, commonly applied in pattern recognition and image identification.

Convolutional Neural Networks (CNNs) process grid-like data, such as images, with convolutional, pooling, and fully connected layers. CNNs are widely used for image and video recognition tasks. Autoencoder Neural Networks perform unsupervised learning for data compression and feature extraction, often used in image denoising and anomaly detection.

Sequence to Sequence (Seq2Seq) Models, with an encoder and decoder, are utilized in machine translation, summarization, and conversation systems. Modular Neural Networks (MNNs) allow multiple networks to collaborate on complex problems, enhancing flexibility and robustness in applications like computer vision and speech recognition.

Artificial neural networks function as weighted directed graphs, where neurons serve as nodes, and connections between neuron outputs and inputs are depicted as directed edges with weights. The Network receives input signals, typically patterns or vectors representing images, and mathematically assigns them using notations like x(n) for each input (n).

In the example below, our Brain (Biological Neural Network) gets input from our eyes of an object( cat). Our neural Network then processes it to understand we saw a cat in a Computerized Neural Network (In this example type, we are using a Convolutional Neural Network) to look at CAT scans of the brain. Using the Convolution and pooling layer and, subsequently, the Neural Network, we can determine if the CAT scan image has an Infarction, Hemorrhage, or Tumor, significantly reducing the time to make a medical diagnosis.

In the following example, we can see our neural Network with an input layer (George Washington), many hidden layers to identify the portion of the image pixels, and finally, the Output layer saying that the image is George.

Summary

We now have a basic understanding of AI, ML, deep learning, and the basics of artificial neural networks. Please continue to my second part of this multi-part blog on high-performance networks for AI/ML infrastructures.

Note some of this content was created or modified using AI tools. Can you tell what is AI created?