Machine learning (ML) is advancing rapidly, enhancing artificial intelligence capabilities. The challenge for business owners exploring ML lies in understanding how to leverage it effectively. ML improves efficiency, productivity, and speed, yet its technical nature can be overwhelming. ML involves complex algorithms tailored to specific situations.
Various businesses may use similar algorithms in the current landscape for everyday purposes. The potential for algorithm sharing arises, but sensitive or proprietary information complicates this process. The question arises: can ML algorithms be easily shared among technologists, similar to open-source platforms like GitHub for coders? The answer is yes, with existing platforms where ML teams share training data sets and open-source libraries with generic algorithms from Google/Tensor Flow, Azure, and AWS.
However, most ML algorithms are specific to each business, particularly how data is labeled. Companies, even within the same industry, operate uniquely. While collaborative sharing occurs in coding, ML presents challenges due to its business-specific nature. Despite differences, the potential for global collaboration in ML exists, similar to how open-source code collaboration improves efficiency and innovation. Like any technology, ML has pros and cons, with room for improvement through widespread partnership to create more effective and beneficial solutions for the world.
GPU vs CPU
We know that neural networks can be used to create complex mathematical algorithms, and it is up to the data scientists to develop the algorithms and implement the neural networks using complex mathematical formulas such as Linear Algebra, Fast Forum Transfer, multivariate calculus, and statistics. These algorithms are performed at the various neurons in the neural network. Below is a typical flowchart of data input, computations, and saves. In this example, a GPU is inserted into the neural network flow to speed up memory transfer.
If we look at the timing of a neural network processing a single run through various neural nodes, we can see that offloading some of the work to the GPU significantly speeds up the process. Here, we see an approximate speed increase of 8X. Now remember this is a single run-through of a single neural network path from input to hidden layer to output. Now multiply this a million, billion, or even trillion times through the neural network. We can see a massive advantage in offloading neural networking functions onto a GPU. The question is, why is a GPU faster than a CPU?
GPUs excel in speed due to their matrix multiplication and convolution efficiency, primarily driven by memory bandwidth rather than parallelism. While CPUs are latency-optimized, GPUs prioritize bandwidth efficiency, resembling a big truck compared to a Corvette. The CPU quickly fetches small amounts of memory, whereas the GPU fetches more significant amounts at once, leveraging its superior memory bandwidth, reaching up to 750GB/s compared to the CPU’s 50GB/s.
Thread parallelism is crucial for GPUs to overcome latency challenges. Multiple GPUs can function together, forming a fleet of trucks and enabling efficient memory access. This parallelism effectively conceals latency, allowing GPUs to provide high bandwidth for large memory chunks with minimal latency drawbacks. The second step involves transferring memory from RAM to local memory on the chip (L1 cache and registers). While less critical, this step still contributes to the overall GPU advantage.
In computation, GPUs outperform CPUs due to their efficient use of register memory. GPUs can allocate a small pack of registers for each processing unit, resulting in an aggregate GPU register size over 30 times larger than CPUs, operating at twice the speed. This allows GPUs to store substantial data in L1 caches and register files, facilitating the reuse of convolutional and matrix multiplication tiles. The hierarchy of importance for GPU performance lies in high-bandwidth main memory, hiding memory access latency under thread parallelism, and the presence of large, fast, and easily programmable register and L1 memory. These factors collectively make GPUs well-suited for deep learning.
To understand why we use GPUs in deep machine learning and neural networks, we need to see what happens at a super low level of loading numbers and functions from DRAM memory into the CPU so the CPU can perform the algorithm at that neural node.
Looking at a typical CPU and memory, we can see that the memory bandwidth is 200GBytes / sec and 25 Giga-FP64, while the CPU can perform 2000 GFLOPs FP64. FLOPS (Floating-point Operations Per Second) is a performance metric indicating the number of floating-point arithmetic calculations a computer processor can execute within one second. Floating-point arithmetic refers to calculations performed on representations of real numbers in computing. For the CPU to actually operate at 2000GFLOPs, we have to be able to feed the CPU from memory. The memory can supply 25 billion double precision floating point values (FP-64), while the CPU can process 2000 billion double precision floating point values.
Compute intensity and memory bandwidth
Because we cannot feed the CPU fast enough from memory, the CPU would need to do 80 operations on that number to break even for every number loaded from memory. There are very few, if any, algorithms that need 80 functions, so we have severely limited memory bandwidth. This number is called Compute intensity.
Let’s look at a simple function called DAXPY, which is simply alpha*x+y=z. There are 2 CPU FLOPs required (multiply and add) and 2 memory loads (X and Y). This is a core operation, so it has the instruction set for a single function called FMA (Fused multiply add). Looking at the diagram, we have 2 loads from memory (DRAM) and a very long time(memory latency) to traverse the circuit board and into the CPU for the x and y to be ready. We perform the 2 flops of (alpha*X)+y very quickly and then output the result. The memory bottleneck hinders speed if we have an extensive neural network with massive datasets and complicated algorithms.
So why is there a memory bandwidth latency constraint? Physics! First, let’s look at the fundamentals of a computer motherboard. With a computer clock speed of 3GHZ, and the speed of electricity in silicon is 60 million M/S, then in 1 clock tick, electricity can travel 20mm (0.8 inches). If we look at a typical chip which is 22×32 mm and the length of the copper from the CPU to DRAm is 50-100mm, we can see it will take 5-10 clock ticks just for the ones and zeros of the data to traverse the CPU and motherboard.
If we put this together, we can calculate how much data can be moved across the memory bus. When performing DAXPY in a typical Intel Xeon 8280 CPU, the memory bandwidth is 131 GB/sec, and the memory bus latency is 89ns. This means we can move 11,659 bytes in 89ns. DAXPY inside the CPU moves 16 bytes per 89ns of latency, which means our memory efficiency is 0.14%! This also means the memory bus is idle 99.86% of the time!
To keep the memory bus busy, we must run 729 iterations of daxpy at once.
Threads to the rescue!
Modifying the instructions and running parallel operations allows us to utilize multiple threads to keep our memory bus busy. How busy is only limited by max threads of the processor or GPU and memory requests made by the instruction set.
One thing stands out when comparing a standard CPU with the NVIDIA A100 GPU. We see a huge difference in memory efficiency between the CPU and GPU. The CPUs have better memory efficiency but have significantly fewer threads available. This is the most critical point in comparing CPU and GPU, and if you take anything away, the massive number of threads in a GPU allows faster processing. They both are trying to fix the problem of keeping the memory bus busy. The GPU has much worse memory efficiency but has 5x more threads than needed.
If we examine BW and latency in an NVIDIA A100 GPU, we see that the on-chip latency and B/W are significantly better than accessing off-chip DRAm via the PCIe bus. The L1 Cache has a 19,400 GB/Sec BW, while the PCIe BW is 25 BG/Sec! The hardware designers intentionally created more threads for the GPU so we could satisfy the CPUs on the GPU and keep the entire memory system busy.
SMs, Warps, and oversubscription
We can have hundreds of Streaming multiprocessors (SM) inside the GPU. We have 64 warps inside the SM; each warp runs 32 threads. This is how a GPU can process hundreds of thousands of threads while a CPU has less than 1000 threads. We can use the concept of parallelism (more on this later) to run 10’s thousands of threads to keep the memory bus busy, and we can even over-subscribe the bus to get even more threads utilized. We can use 4 warps concurrently, so for every clock tick, 4 warps running 32 threads each can perform Flops. This is how the GPU designers fought the memory latency by utilizing threads.
The GPU’s secret is oversubscription. In a GPU, we can have 432 active Warps with 13,824 active threads and 6,480 waiting warps with 207,360 waiting threads. The GPU can switch from one warp to the next in a single clock cycle and process the successive 432 warps. The GPU is designed as a throughput machine utilizing oversubscription, while the CPU is a latency machine where oversubscription is terrible. We can compare the CPU to a car with one occupant that takes 45 minutes to reach its destination. The GPU compares to a train that makes stops and takes on more occupants but is slower. In the end, both get the destination, but the GPU, while slower, significantly carries more occupants (threads). These threads are the computations needed in machine learning, such as daxpy.
In machine learning, they operate in various ways, such as Element-wise in daxpy, Local in Convolution, and All-to-All in Fast Fourier Transform.
Let us understand how parallelism works and how we can use all these threads simultaneously. Let’s say we have trained an AI to recognize a dog. We overlay the dog with a grid, creating many work blocks. The blocks are executed independently, and the GPU is oversubscribed with blocks of work. (The GPU is designed to handle oversubscription). Each block has many threads working together in each block using LOCAL (GPU memory-L1, L2, and HBM) for data sharing.
Utilizing CUDA’s hierarchical execution, we can see how we can use parallelism to break the work into blocks with equal threads. Then, the block works independently and in Parallel. These threads run independently in parallel and may or may not synchronize to exchange data from other blocks of threads. Also, this is kept on the GPU, taking advantage of the thread oversubscription and on-chip memory.
Summary
This post shows how the GPU can significantly reduce execution times by utilizing parallelism and thread oversubscription. The GPU can beat the memory bus latency issue by using on-chip memory and threads to run work directly on the GPU. We will next discuss parallel computing and how we can leverage multiple onboard GPUs and, finally, GPUs on other computers via Infiniband or Ethernet networking.