Optimizing AI/ML Performance: Cutting-Edge GPU Network Designs-Part 2-CPU and GPU usage in AI/ML for the Network Engineer

Machine learning (ML) is advancing rapidly, enhancing artificial intelligence capabilities. The challenge for business owners exploring ML lies in understanding how to leverage it effectively. ML improves efficiency, productivity, and speed, yet its technical nature can be overwhelming. ML involves complex algorithms tailored to specific situations.

Various businesses may use similar algorithms in the current landscape for everyday purposes. The potential for algorithm sharing arises, but sensitive or proprietary information complicates this process. The question arises: can ML algorithms be easily shared among technologists, similar to open-source platforms like GitHub for coders? The answer is yes, with existing platforms where ML teams share training data sets and open-source libraries with generic algorithms from Google/Tensor Flow, Azure, and AWS.

However, most ML algorithms are specific to each business, particularly how data is labeled. Companies, even within the same industry, operate uniquely. While collaborative sharing occurs in coding, ML presents challenges due to its business-specific nature. Despite differences, the potential for global collaboration in ML exists, similar to how open-source code collaboration improves efficiency and innovation. Like any technology, ML has pros and cons, with room for improvement through widespread partnership to create more effective and beneficial solutions for the world.

GPU vs CPU

We know that neural networks can be used to create complex mathematical algorithms, and it is up to the data scientists to develop the algorithms and implement the neural networks using complex mathematical formulas such as Linear Algebra, Fast Forum Transfer, multivariate calculus, and statistics. These algorithms are performed at the various neurons in the neural network. Below is a typical flowchart of data input, computations, and saves. In this example, a GPU is inserted into the neural network flow to speed up memory transfer.

CPU only VS CPU and GPU

If we look at the timing of a neural network processing a single run through various neural nodes, we can see that offloading some of the work to the GPU significantly speeds up the process. Here, we see an approximate speed increase of 8X. Now remember this is a single run-through of a single neural network path from input to hidden layer to output. Now multiply this a million, billion, or even trillion times through the neural network. We can see a massive advantage in offloading neural networking functions onto a GPU. The question is, why is a GPU faster than a CPU?


GPUs excel in speed due to their matrix multiplication and convolution efficiency, primarily driven by memory bandwidth rather than parallelism. While CPUs are latency-optimized, GPUs prioritize bandwidth efficiency, resembling a big truck compared to a Corvette. The CPU quickly fetches small amounts of memory, whereas the GPU fetches more significant amounts at once, leveraging its superior memory bandwidth, reaching up to 750GB/s compared to the CPU’s 50GB/s.

Thread parallelism is crucial for GPUs to overcome latency challenges. Multiple GPUs can function together, forming a fleet of trucks and enabling efficient memory access. This parallelism effectively conceals latency, allowing GPUs to provide high bandwidth for large memory chunks with minimal latency drawbacks. The second step involves transferring memory from RAM to local memory on the chip (L1 cache and registers). While less critical, this step still contributes to the overall GPU advantage.

In computation, GPUs outperform CPUs due to their efficient use of register memory. GPUs can allocate a small pack of registers for each processing unit, resulting in an aggregate GPU register size over 30 times larger than CPUs, operating at twice the speed. This allows GPUs to store substantial data in L1 caches and register files, facilitating the reuse of convolutional and matrix multiplication tiles. The hierarchy of importance for GPU performance lies in high-bandwidth main memory, hiding memory access latency under thread parallelism, and the presence of large, fast, and easily programmable register and L1 memory. These factors collectively make GPUs well-suited for deep learning.

To understand why we use GPUs in deep machine learning and neural networks, we need to see what happens at a super low level of loading numbers and functions from DRAM memory into the CPU so the CPU can perform the algorithm at that neural node.

Looking at a typical CPU and memory, we can see that the memory bandwidth is 200GBytes / sec and 25 Giga-FP64, while the CPU can perform 2000 GFLOPs FP64. FLOPS (Floating-point Operations Per Second) is a performance metric indicating the number of floating-point arithmetic calculations a computer processor can execute within one second. Floating-point arithmetic refers to calculations performed on representations of real numbers in computing. For the CPU to actually operate at 2000GFLOPs, we have to be able to feed the CPU from memory. The memory can supply 25 billion double precision floating point values (FP-64), while the CPU can process 2000 billion double precision floating point values.

Difference between single and double precision values or FR-64 vs FP-32

Compute intensity and memory bandwidth

Because we cannot feed the CPU fast enough from memory, the CPU would need to do 80 operations on that number to break even for every number loaded from memory. There are very few, if any, algorithms that need 80 functions, so we have severely limited memory bandwidth. This number is called Compute intensity.

Let’s look at a simple function called DAXPY, which is simply alpha*x+y=z. There are 2 CPU FLOPs required (multiply and add) and 2 memory loads (X and Y). This is a core operation, so it has the instruction set for a single function called FMA (Fused multiply add). Looking at the diagram, we have 2 loads from memory (DRAM) and a very long time(memory latency) to traverse the circuit board and into the CPU for the x and y to be ready. We perform the 2 flops of (alpha*X)+y very quickly and then output the result. The memory bottleneck hinders speed if we have an extensive neural network with massive datasets and complicated algorithms.

So why is there a memory bandwidth latency constraint? Physics! First, let’s look at the fundamentals of a computer motherboard. With a computer clock speed of 3GHZ, and the speed of electricity in silicon is 60 million M/S, then in 1 clock tick, electricity can travel 20mm (0.8 inches). If we look at a typical chip which is 22×32 mm and the length of the copper from the CPU to DRAm is 50-100mm, we can see it will take 5-10 clock ticks just for the ones and zeros of the data to traverse the CPU and motherboard.

If we put this together, we can calculate how much data can be moved across the memory bus. When performing DAXPY in a typical Intel Xeon 8280 CPU, the memory bandwidth is 131 GB/sec, and the memory bus latency is 89ns. This means we can move 11,659 bytes in 89ns. DAXPY inside the CPU moves 16 bytes per 89ns of latency, which means our memory efficiency is 0.14%! This also means the memory bus is idle 99.86% of the time!

To keep the memory bus busy, we must run 729 iterations of daxpy at once.

Threads to the rescue!

Modifying the instructions and running parallel operations allows us to utilize multiple threads to keep our memory bus busy. How busy is only limited by max threads of the processor or GPU and memory requests made by the instruction set.

One thing stands out when comparing a standard CPU with the NVIDIA A100 GPU. We see a huge difference in memory efficiency between the CPU and GPU. The CPUs have better memory efficiency but have significantly fewer threads available. This is the most critical point in comparing CPU and GPU, and if you take anything away, the massive number of threads in a GPU allows faster processing. They both are trying to fix the problem of keeping the memory bus busy. The GPU has much worse memory efficiency but has 5x more threads than needed.

If we examine BW and latency in an NVIDIA A100 GPU, we see that the on-chip latency and B/W are significantly better than accessing off-chip DRAm via the PCIe bus. The L1 Cache has a 19,400 GB/Sec BW, while the PCIe BW is 25 BG/Sec! The hardware designers intentionally created more threads for the GPU so we could satisfy the CPUs on the GPU and keep the entire memory system busy.

SMs, Warps, and oversubscription

We can have hundreds of Streaming multiprocessors (SM) inside the GPU. We have 64 warps inside the SM; each warp runs 32 threads. This is how a GPU can process hundreds of thousands of threads while a CPU has less than 1000 threads. We can use the concept of parallelism (more on this later) to run 10’s thousands of threads to keep the memory bus busy, and we can even over-subscribe the bus to get even more threads utilized. We can use 4 warps concurrently, so for every clock tick, 4 warps running 32 threads each can perform Flops. This is how the GPU designers fought the memory latency by utilizing threads.

The GPU’s secret is oversubscription. In a GPU, we can have 432 active Warps with 13,824 active threads and 6,480 waiting warps with 207,360 waiting threads. The GPU can switch from one warp to the next in a single clock cycle and process the successive 432 warps. The GPU is designed as a throughput machine utilizing oversubscription, while the CPU is a latency machine where oversubscription is terrible. We can compare the CPU to a car with one occupant that takes 45 minutes to reach its destination. The GPU compares to a train that makes stops and takes on more occupants but is slower. In the end, both get the destination, but the GPU, while slower, significantly carries more occupants (threads). These threads are the computations needed in machine learning, such as daxpy.

In machine learning, they operate in various ways, such as Element-wise in daxpy, Local in Convolution, and All-to-All in Fast Fourier Transform.

Let us understand how parallelism works and how we can use all these threads simultaneously. Let’s say we have trained an AI to recognize a dog. We overlay the dog with a grid, creating many work blocks. The blocks are executed independently, and the GPU is oversubscribed with blocks of work. (The GPU is designed to handle oversubscription). Each block has many threads working together in each block using LOCAL (GPU memory-L1, L2, and HBM) for data sharing.

Utilizing CUDA’s hierarchical execution, we can see how we can use parallelism to break the work into blocks with equal threads. Then, the block works independently and in Parallel. These threads run independently in parallel and may or may not synchronize to exchange data from other blocks of threads. Also, this is kept on the GPU, taking advantage of the thread oversubscription and on-chip memory.

Summary

This post shows how the GPU can significantly reduce execution times by utilizing parallelism and thread oversubscription. The GPU can beat the memory bus latency issue by using on-chip memory and threads to run work directly on the GPU. We will next discuss parallel computing and how we can leverage multiple onboard GPUs and, finally, GPUs on other computers via Infiniband or Ethernet networking.

Optimizing AI/ML Performance: Cutting-Edge GPU Network Designs- Part 1-AI/ML and Neural Networks for the Network Engineer

Hopefully, everyone has been paying attention to the news and is somewhat familiar with artificial intelligence (AI). Today, interest in and applications of AI are at an all-time high, with breakthroughs happening daily. ChatGPT has celebrated a birthday and turned a year old, and the amount of AI-related content has grown exponentially. For those readers who have not been keeping up with AI, breakthroughs in AI and the consumption of AI-enhanced technologies by everyone on the planet are about to accelerate at a rate that will necessitate companies and organizations to bring computing, storage, and networking for AI systems back in the house. Today, most companies and organizations are starting to use cloud-based AI using Machine Learning (ML) on what is known as Large Language Models (LLMs).

This series of 6 posts will guide the network engineer through understanding the basics of AI/ML, covering topics such as artificial neurons, neural networks, GPU and memory latency, parallel and distributed computing, and network design best practices.

Servers and infrastructure costs will exponentially increase.

Everyone needs to take note of the following as it will affect data center design and the cost to the business. Tirias Research projects that by 2028, the combined infrastructure and operational expenses for generative AI data center servers alone will surpass $76 billion, posing challenges to the business models and profitability of emerging services like search, content creation, and business automation incorporating Generative AI. This forecast is based on an aggressive 4X improvement in hardware compute performance and anticipates a 50X increase in processing workloads, resulting in a substantial rise in data center power consumption to nearly 4,250 megawatts. This cost estimate, which exceeds twice the estimated annual operating cost of Amazon’s AWS, encompasses labor, power, cooling, ancillary hardware, and 3-year amortized server costs, excluding the data center building structure. The model utilized for this forecast is grounded in server benchmarks using 10 Nvidia GPU accelerators with specific power consumption details. Everyone from the C-Level down needs to plan for this infrastructure cost.

With the exponential growth in GenAI demand, relying on breakthroughs in processing or chip design appears to be challenging, given the slowdown of Moore’s Law. Meeting the rising consumer expectations for improved GenAI output is not without trade-offs, as efficiency and performance gains may be counteracted. As consumer usage escalates, inevitably, costs will also rise.

What are AI, ML, and DL?

Artificial Intelligence (AI) integrates human-like intelligence into machines through algorithms. “Artificial” denotes human-made or non-natural entities, while “Intelligence” relates to the ability to understand or think. AI, defined as the study of training machines (computers) to emulate human brain functions, focuses on learning, reasoning, and self-correction for maximum efficiency.

Machine Learning (ML), a subset of AI, enables computers to autonomously learn from experiences without explicit programming. ML develops programs that access data, autonomously making observations to identify patterns and enhance decision-making based on provided examples, aiming for systems to learn through experience without human intervention.

Deep Learning (DL), a sub-part of Machine Learning, leverages Neural Networks to mimic brain-like behavior. DL algorithms process information patterns akin to human cognition, identifying and classifying data on larger datasets. DL machines autonomously handle the prediction mechanism, operating on a broader scale than ML.

How imminent is the development of artificial intelligence that surpasses the human mind? The succinct response is not in the immediate future, yet the pace of progress has notably accelerated since the inception of the modern AI field in the 1950s. During the 1950s and 1960s, AI experienced significant advancements as computer scientists, mathematicians, and experts from diverse fields enhanced algorithms and hardware. Despite early optimism about the imminent realization of a thinking machine comparable to the human brain, this ambitious goal proved elusive, leading to a temporary decline in support for AI research. The field witnessed fluctuations until it experienced a resurgence around 2012, primarily fueled by the revolutionary advancements in deep learning and neural networks.

AI vs. Machine Learning vs. Deep Learning Examples:

Artificial Intelligence (AI): AI involves creating computer systems capable of human-like intelligence, tackling tasks that typically require human cognitive abilities. Examples include:

  1. Speech Recognition: Employed in self-driving cars, security systems, and medical imaging for classifying and transcribing speech using deep learning algorithms.
  2. Personalized Recommendations: Platforms like Amazon and Netflix use AI to analyze user history for customized product and content recommendations.
  3. Predictive Maintenance: AI-powered systems predict equipment failures by analyzing sensor data, reducing downtime and maintenance costs.
  4. Medical Diagnosis: AI analyzes medical images to aid doctors in making accurate diagnoses and treatment plans.
  5. Autonomous Vehicles: Self-driving cars use AI algorithms to make navigation, obstacle avoidance, and route planning decisions.
  6. Virtual Personal Assistants (VPA): Siri and Alexa use natural language processing to understand and respond to user requests.

Machine Learning (ML): ML, a subset of AI, involves algorithms and statistical models that enable computers to learn and improve performance without explicit programming. Examples include:

  1. Image Recognition: ML algorithms classify images, applied in self-driving cars, security systems, and medical imaging.
  2. Speech Recognition: ML algorithms transcribe speech in virtual assistants and call centers.
  3. Natural Language Processing (NLP): ML algorithms understand and generate human language, applied in chatbots and virtual assistants.
  4. Recommendation Systems: ML algorithms analyze user data for personalized e-commerce and streaming services.
  5. Sentiment Analysis: ML algorithms classify text or speech sentiment as positive, negative, or neutral.
  6. Predictive Maintenance: ML algorithms analyze sensor data to predict equipment failures and reduce downtime.

Deep Learning: Deep Learning, a type of ML, employs artificial neural networks with multiple layers for learning and decision-making. Examples include:

  1. Image and Video Recognition: Deep learning algorithms classify and analyze visual data in self-driving cars, security systems, and medical imaging.
  2. Generative Models: Deep learning algorithms create new content based on existing data, applied in image and video generation.
  3. Autonomous Vehicles: Deep learning algorithms analyze sensor data for decision-making in self-driving cars.
  4. Image Classification: Deep learning recognizes objects and scenes in images.
  5. Speech Recognition: Deep learning transcribes spoken words into text for voice-controlled interfaces and dictation software.
  6. Recommender Systems: Deep learning algorithms make personalized recommendations based on user behavior and preferences.
  7. Fraud Detection: Deep learning algorithms detect fraud patterns in financial transactions.
  8. Game-playing AI: Deep learning algorithms power game-playing AI that competes at superhuman levels.
  9. Time Series Forecasting: Deep learning forecasts future values in time series data, such as stock prices and weather patterns.

Machine Learning (ML) and neural networks

If we look at what AI consists of, we see that computer programs can learn and reason like humans. Part of AI is Machine Learning or ML. ML are mathematical algorithms that can learn WITHOUT being explicitly programmed by humans. Finally, we have deep learning inside machine learning, which uses artificial neural networks to adapt and learn from vast datasets. This learning is called training using enormous datasets.

The concept of artificial neural networks naturally originated as a representation of the functioning of neurons in the brain, referred to as ‘connectionism.’ This approach utilized interconnected circuits to emulate intelligent behavior, illustrated in 1943 through a primary electrical circuit by neurophysiologist Warren McCulloch and mathematician Walter Pitts. Donald Hebb expanded upon this concept in his 1949 book, “The Organization of Behaviour,” suggesting that neural pathways strengthen with each repeated use, particularly between neurons that tend to activate simultaneously. This marked the initial steps in the extensive endeavor to quantify the intricate processes of the brain.

A neural network emulates densely interconnected brain cells in a computer, learning autonomously without explicit programming. Unlike a human brain, neural networks are software simulations on ordinary computers, utilizing traditional transistors and logic gates to mimic billions of interconnected cells. This simulation contrasts with the structure of a human brain, as no attempt has been made to build a computer mirroring the brain’s densely parallel arrangement. The analogy between a neural network and a human brain resembles the difference between a computer weather model and actual weather elements. Neural networks are essentially collections of algebraic variables and equations within a computer, devoid of meaning to the computer itself but significant to programmers.

Artificial neurons (Perceptron)

The brain has approximately 10 billion interconnected neurons, each linked to around 10,000 others. Neurons receive electrochemical inputs at dendrites, and if the cumulative signal surpasses a certain threshold, the neuron fires an electrochemical signal along the axon, transmitting it to connected neurons. The brain’s complexity arises from these interconnected electrochemical transmitting neurons, where each neuron functions as a simple processing unit, performing a weighted sum of inputs and firing a binary signal if the total information surpasses a threshold. Inspired by this model, artificial neural networks have shown proficiency in tasks challenging for traditional computers, like image recognition and predictions based on past knowledge. However, they have yet to approach the intricacy of the brain.

The Perceptron serves as a mathematical emulation of a biological neuron. Unlike the physical counterparts where dendrites receive electrical signals, the Perceptron represents these signals numerically. The modulation of electrical signals at synapses mimicked in the Perceptron, involves multiplying each input value by weight. Similar to the biological process, a Perceptron generates an output signal only when the total strength of input signals surpasses a set threshold. This is simulated by calculating the weighted sum of inputs and applying a step function to determine the output. The output is then transmitted to other perceptrons, akin to the interconnected nature of biological neural networks.

A biological neuron
An artificial neuron (Perceptron)
Biological neurons compared to Artificial neurons

In artificial neural networks, a neuron functions as a mathematical entity mirroring the behavior of biological neurons. Typically, it calculates the weighted average of its inputs, passing this sum through a nonlinear activation function like the sigmoid. The neuron’s output then serves as input for neurons in another layer, perpetuating a similar computation (weighted sum and activation function). This computation entails multiplying a vector of input/activation states by a weight matrix and processing the resulting vector through the activation function.

Mathematical representation of a Neural Network Neuron.


AlexNet, a convolutional neural network (CNN) architecture, was crafted by Alex Krizhevsky in partnership with Ilya Sutskever and Geoffrey Hinton, Krizhevsky’s Ph.D. advisor at the University of Toronto. In 2012, Krizhevsky and Sutskever harnessed the capabilities of two GeForce NVIDIA GPU cards to create an influential visual recognition network, transforming the landscape of neural network research. Before this, neural networks were predominantly trained on CPUs, and the shift to GPUs marked a pivotal moment, paving the path for the advancement of sophisticated AI models.

Nvidia is credited with creating the first “true” GPU in 1999, marking a significant leap from earlier attempts in the 1980s to enhance output and video imagery. Nvidia’s research led to a thousandfold increase in computational speeds over a decade, with a notable impact in 2009 when it supported the “big bang of deep learning.” GPUs, crucial in machine learning, boast about 200 times more processors than CPUs, enabling parallel processing and significantly accelerating tasks like neural network training. Initially designed for gaming, GPUs have become instrumental in deep learning applications due to their similar processing capabilities and large on-board RAM.

Neural network overview

Neural networks are a transformative force in machine learning, driving advancements in speech recognition, image processing, and medical diagnosis. The rapid evolution of this technology has led to robust systems mimicking brain-like information processing. Their impact spans various industries, from healthcare to finance to marketing, addressing complex challenges innovatively. Despite these strides, we’ve only begun to unlock the full potential of neural networks. Diverse in structure, neural networks come in various forms, each uniquely tailored to address specific problems and data types, offering real-world applications across domains.

Neural networks, a subset of deep learning, emulate the human brain’s structure and function. Comprising interconnected neurons, they process and transmit information, learning patterns from extensive datasets. Often considered a black box due to their opaque workings, neural networks take various inputs (images, text, numerical data) and generate outputs, making predictions or classifications. Proficient at pattern recognition, neural networks excel in solving complex problems with large datasets, from stock market predictions to medical image analysis and weather forecasting.

Neural networks learn and improve through training, adjusting weights and biases to minimize prediction errors. In this process, neurons in the input layer represent different data aspects, such as age and gender, in a movie preference prediction scenario. Weights signify the importance of each input, with higher weights indicating greater significance. Biases, akin to weights, are added before the activation function, acting as default values that adjust predictions based on overall data tendencies. For instance, if the Network recognizes gender-based preferences, it may add a positive bias for predicting movie preferences in women, reflecting learned patterns from the data.

A neural network typically comprises three layers: the input layer for data input, the hidden layer(s) for computation, and the output layer for predictions. The hidden layer(s) perform most calculations, with each neuron connected to those in the preceding layer. During training, weights and biases of these connections are adjusted to enhance network performance. The configuration of hidden layers and neurons varies based on the problem’s complexity and available data. Deep neural networks, featuring multiple hidden layers, excel in complex tasks like image recognition and natural language processing.

Boolean functions

The Perception or Neuron can perform various mathematical functions using gates and Boolean Algebra using the AND, OR, XOR, NAND, NOR, and XNOR functions.

We can build our neural networks using these functions in each neural node.

Types of Neural Networks

In graphical representations of artificial neural networks, neurons are symbolized as circles, acting as “containers” for mathematical functions. Layers, comprised of one or more neurons, are often organized in vertical lines. In simpler feedforward neural networks, input layer neurons process numeric data representations, employing activation functions and passing outputs to subsequent layers. Hidden layer neurons, receiving inputs from previous layers, perform similar functions and transmit results to successive layers. Depending on the Network’s purpose, output layer neurons process information and generate outcomes ranging from binary classifications to numeric predictions. Neurons can belong to input (red circles), hidden (blue circles), or output (green circles) layers. Each computer or cluster may be depicted as a single neuron in graphical representations in complex systems.

If we look at neural networks, many varieties can perform many functions required for deep learning. They all use the same basic building blocks based on the Perceptron model.

Feedforward Neural Networks consist of an input layer, one or more hidden layers, and an output layer, facilitating data flow in a forward direction. Commonly used for tasks like image and speech recognition, natural language processing, and predictive modeling, they adjust neuron weights and biases during training to minimize error. The Perceptron, one of the earliest neural networks, is a single-layer network suitable for linearly separable problems. The Multilayer Perceptron (MLP) is a feedforward network with multiple perceptron layers prevalent in image recognition and natural language processing.

Recurrent Neural Networks (RNNs) process sequential data with recurrent neurons, excelling in language translation and text generation tasks. Long Short-Term Memory (LSTM) neural networks, a type of RNN, handle long-term dependencies and are effective in natural language processing. Radial Basis Function (RBF) Neural Networks transform inputs using radial basis functions, commonly applied in pattern recognition and image identification.

Convolutional Neural Networks (CNNs) process grid-like data, such as images, with convolutional, pooling, and fully connected layers. CNNs are widely used for image and video recognition tasks. Autoencoder Neural Networks perform unsupervised learning for data compression and feature extraction, often used in image denoising and anomaly detection.

Sequence to Sequence (Seq2Seq) Models, with an encoder and decoder, are utilized in machine translation, summarization, and conversation systems. Modular Neural Networks (MNNs) allow multiple networks to collaborate on complex problems, enhancing flexibility and robustness in applications like computer vision and speech recognition.

Artificial neural networks function as weighted directed graphs, where neurons serve as nodes, and connections between neuron outputs and inputs are depicted as directed edges with weights. The Network receives input signals, typically patterns or vectors representing images, and mathematically assigns them using notations like x(n) for each input (n).

In the example below, our Brain (Biological Neural Network) gets input from our eyes of an object( cat). Our neural Network then processes it to understand we saw a cat in a Computerized Neural Network (In this example type, we are using a Convolutional Neural Network) to look at CAT scans of the brain. Using the Convolution and pooling layer and, subsequently, the Neural Network, we can determine if the CAT scan image has an Infarction, Hemorrhage, or Tumor, significantly reducing the time to make a medical diagnosis.

In the following example, we can see our neural Network with an input layer (George Washington), many hidden layers to identify the portion of the image pixels, and finally, the Output layer saying that the image is George.

Summary

We now have a basic understanding of AI, ML, deep learning, and the basics of artificial neural networks. Please continue to my second part of this multi-part blog on high-performance networks for AI/ML infrastructures.

Note some of this content was created or modified using AI tools. Can you tell what is AI created?

Cisco Prioritizes Greater Data Center Sustainability

The escalating demand for computing power in today’s digital age has led to an unprecedented proliferation of data centers, but this surge comes at a substantial cost. Data-center servers and switches, notorious for their energy appetite and heat emissions, collectively account for up to 2 percent of the total electricity consumption in the United States. In financial terms, energy consumption can represent a staggering 30 percent of data center costs. This alarming environmental and economic impact has ignited a compelling need for transformative solutions in the realm of data center sustainability.

Recognizing the urgency of addressing this challenge, organizations are increasingly focusing on optimizing energy usage within their data centers. The imperative is not only to mitigate the environmental footprint but also to realize substantial cost savings. At the forefront of this movement toward greater sustainability is the Cisco Nexus Dashboard, positioned as a pivotal tool in the journey towards more energy-efficient data centers.

The primary objective of the Cisco Nexus Dashboard is to empower organizations to usher in a new era of energy efficiency within their data centers. Through advanced analytics and forecasting capabilities, the platform enables customers to gain precise insights into energy consumption, emissions, and associated costs. This proactive approach allows data center operators to make informed decisions that align with their sustainability goals and financial objectives.

The significance of this endeavor extends beyond mere energy conservation. The Cisco Nexus Dashboard serves as a strategic enabler, facilitating organizations in meeting the ever-expanding operational requirements of their data center applications. This harmonious balance between sustainability and operational efficiency is crucial for organizations striving to remain at the forefront of technological innovation while upholding a commitment to environmental responsibility.

In essence, the Cisco Nexus Dashboard is not merely a management tool but a transformative force driving the evolution of data centers toward a future of heightened sustainability. By providing a comprehensive view of energy consumption metrics, emissions forecasts, and cost implications, the platform empowers organizations to forge a path towards data center operations that are not only economically viable but also environmentally conscientious. Through this holistic approach, the Cisco Nexus Dashboard becomes a cornerstone in the construction of a more sustainable and resilient digital infrastructure for forward-looking organizations.

Cisco/WWT NEXUS Dashboard Webinar on Cisco TV

Cisco TV has released the joint Cisco/World Wide Technologies webinar recording on NEXUS Dashboard that was presented live on Sept 21st, 2022. Please join Jim Mathers | Sr. Product Manager, Cisco Cloud Networking Team, Matthias Wessendorf | Principal Engineer, Cisco Cloud Networking Team, and myself Mike Witte | Principal Solutions Architect, WWT discussing NEXUS Dashboard real-world use cases and live demo.

In this webinar, you will learn about how Cisco NEXUS Dashboard and NEXUS Insights work, see real-world use cases and see a live demo of how NEXUS Dashboard and Insights can drastically reduce your (MTTI) Mean Time to Innocence from a network standpoint and efforts can be focused elsewhere to troubleshoot an outage.

Ask the customer, Cisco Nexus Dashboard

Cisco Insider Series for Cloud 

Join us for a special webinar featuring a real-world story for Cisco Nexus Dashboard!

World Wide Technology (WWT) is an IT service management company. It employs approximately 7,000 people worldwide and its key services include cloud, consulting, digital consulting, security transformation, and supply chain management. As a leading IoT company, WWT provides the services needed to support IT innovation and “make a new world happen”.

During this session, learn about:

  • The operational challenges World Wide Technology faced and how they adopted a single network platform for all data centers.
  • Reduced time spent on network operations to enable innovation, and accelerated fabric deployment from days to minutes.

Don’t miss out on this opportunity to learn more from our experts on how to simplify your hybrid cloud network operations with our intuitive and centralized management console, Cisco Nexus Dashboard that can rapidly assure, change, scale, and troubleshoot. And most importantly, join us to learn about our customer’s journey.

Jim will talk about what the Cisco NEXUS Dashboard Platform brings to customers and what applications run on it to share the correlated database to reduce MTTI as well as futures from the BU

I will go in depth and explain how Cisco NEXUS Dashboard and Insights operate under the covers

I will also show a real-life customer scenario and how it changed their troubleshooting methodology reducing the Mean Time To Repair (MTTR) as well as not involving multiple groups such as the VMWare team, the applications team, and the security team to find the resolution. This customer went from involving a dozen people to troubleshoot an issue and 8 hours to a quick 10-minute resolution using the power of a correlated DB in significantly reducing and becoming proactive instead of reactive.

We will discuss WWT’s offerings for customers that want to kick the tires on Cisco NEXUS Dashboard and Insights. WWT has created a detailed 2-part whitepaper on Installing and operating the product. Part 1 of the whitepaper details standing up the NEXUS Dashboard cluster on either physical appliances or in a POC situation using VMWare or OpenStack VMs to create the cluster. By utilizing the VM route it is no upfront cost to kick the Cisco NEXUS Dashboard tires. Part 2 of the white paper details Installing and onboarding Insights and 20 real-world use cases to determine if the solution will give the visibility as expected before buying it by doing a try-and-buy. WWT also offers SKU-based services to help the customer with the try-and-buy as well as professional services to install the NEXUS Dashboard in production after the trial.

We then transition to Matthias to see a real-world live demo of various use cases in production. We see very simple troubleshooting steps for determining connectivity starting with a natural language search where we can simply say can Endpoint A talk to Endpoint B. This will show the connectivity from the endpoints and how they can communicate and any contract involved.

We can then drill in and see the connectivity, what path the traffic takes through which switches, packet drops or contract drops

We will also see pre-change analysis that can be run on the JSON being used during a change control window how it will affect the fabric before the actual change is made.

If you need any articles, white papers and labs on NEXUS Dashboard and Insights as well as ACI, DCNM or other data center technologies simply go to the World Wide Technology website, create an account and gain access to our labs and articles.

Also, feel free to contact me with the methods below; Michael.Witte@wwt.com

LinkedIn

Twitter

IT Blog

The future of hybrid work

Hybrid Work – Enrique Dans – Medium

As the Pandemic moves on to its 3rd year and most of the workforce has settled in working remotely, I think it’s a good time to look back and look forward to what’s next.

When businesses and large enterprises were forced to close their corporate and branch doors to their workers to prevent the spread of Covid-19, everyone was completely taken off guard. Every corporation has a Disaster Recovery protocol, with run books and mitigation processes for data center resiliency,
application resiliency, and data resiliency from things like ransomware. The one thing no one could have anticipated was personnel resiliency. Most corporations had VPN solutions for temporary use such as snowstorms, severe weather, fires, but no one could have possibly thought that we would be going
into year 3 with no end in sight as the virus mutates.

At some point, we will go back to work in our branches and corporate headquarters, but most workers may choose to want to work in a hybrid work situation. Hybrid work could mean full remote worker, or it could mean a few days in the office. The ability of our workers to be resilient and productive from remote surprised most management. Remote work, in general, was frowned upon by most large corporations; typically, you had to be in your seat at 8:00 AM and could leave at 5 PM. For most, that means your day started at 6:30 AM, getting into the car for your hour+ commute, then find parking to get to your desk by 8 AM, then arrive home after a giant everyday traffic jam at 6:30 PM. It was the norm for an 8-hour workday to become 12 hours taking commuting and what everybody was used to and status quo.

Time flies when traffic doesn't: Congestion costs Charlotte drivers 57  hours a year

When Covid hit, and we were all sent home with our laptops or workstations, IT scrambled to buy bigger VPN concentrators, more licensing for the VPN gateways, upgrade the hub router and links, building VDI solutions; it was a very crazy 2020 to build this all out. Unfortunately, this was built as a temporary solution; most thought this would be 3-6 months, but it keeps dragging on. And guess what? Business went on as usual; people weren’t sneaking around going to the movies or shopping; in fact, they were more productive. Companies were closing buildings and selling them off, and the workforce was now mainly remote and functioning perfectly. Many of us worked in PJs, we saw our spouses and children more, and frankly, most got a better work-life balance.

It's April 16th and that means it's National Wear Your Pajamas To Work Day!!

The issue now is that we need to re-think where hybrid work goes in the next 5-10-15 years. As boomers start to retire and younger people who are used to remote learning start working remotely, they won’t want to go back and sit in a branch doing a 9-5. They can find another more flexible job, which is a good part of the “Great Resignation” we are seeing. Companies that can be more flexible have a higher retention rate and honestly happier employees. Those who don’t force people back may find themselves looking for new hires.

Shutterstock

Just as companies must re-think hybrid work, network architects must create the correct architectures to securely support workers wherever they may be connected to their applications, whether they are in the public or private cloud or sitting in a CoLo. The major challenge for these solutions is to provide a secure method to protect our remote workers and offer End-to-end visibility from user to application.

Tiny Telescope Steam Jewelry by clockwork-zero on DeviantArt

In this quick post, many companies provide excellent SASE models; there are too many to talk about here. They offer good segmentation good security, but most lack good visibility from end-users to their app and underlying infrastructure, wherever it may lie. And therein lies the issue VISIBILITY FROM USER TO APP. We can pull telemetry from each domain; the trick is how do we stitch it all together. That is what Full Stack Observability (FSO) is all about. A single pane of glass shows end-to-end telemetry from user to application wherever they may reside. And oh, we also need business analytics because that end user may be a customer buying something off our website. So the user could be a remote worker or a customer. And the applications could be what you provide for employees, customers, or possibly both(The WWT Platform is a great example where both employees and customers consume applications(Labs and Content).

So now lets discuss what’s really important is the design and architectures required for the new “Hybrid So now let’s discuss what’s important: the design and architectures required for the new “Hybrid Workforce.” First, organizations must embrace the SASE model over traditional VPNs and gateways to secure their end-users. Gartner had a report back in July of 2019 stating, “Digitalization, work from anywhere and cloud-based computing has accelerated cloud-delivered SASE offerings to enable anywhere, anytime access from any device. Security and risk management leaders should build a migration plan from legacy perimeter and hardware-based offerings to a SASE model. You can’t just flip a switch to adopt SASE and enable anywhere, anytime access from any device.”

Flip the Switch — Jesus Fields

This was very prophetic given that within 9 month companies were scrambling to create their homemade versions of SASE for their remote employees due the the Covid-19 pandemic. As defined by Gartner, the SASE category consists of four main characteristics: Identity-driven: User and resource identity, Cloud-native architecture: elasticity, adaptability, self-healing, and self-maintenance, Supports all edges: data centers, branch offices, cloud resources, and mobile users, Globally distributed: SASE cloud must be globally distributed” What Gartner and others are missing is telemetry and visibility that must be vendor agnostic and can be easily used in a data lake for our Full Stack Observability(FSO). There are product out there now such as “Thousand Eyes” from Cisco that can provide that end user visibility into the multitude of SASE offerings and give visibility for the end user. There are SASE models that terminate in public cloud and products like Thousand Eyes can give the visibility of the user/customer to app if its in public cloud but then what if the app is back in the corporate datacenter? We then need connectivity from the public cloud SASE hub back to corporates Datacenter(s). After that the visibility for FSO breaks down, and so does Gartners “Globally distributed:” characteristic for SASE.

In general, utilizing a Colo that has very low latency to the public cloud AND is globally diverse and close to the customer’s datacenter(s) is a better overall architecture. The question now remains do I build my end-to-end connectivity and FSO model out of pieces and parts from various vendors or look for one vendor to supply this type of offering? First, designing, implementing, automating, and day 2 support of a build your own would be very difficult unless you were a massive corporation with unlimited talented individuals in big data, AIops, automation, networking, etc. Not going to happen.

When the SASE vendors all come in to see the CIO and directors they see this

1917 Sears Modern Homes | Sears & Roebuck Kit Homes The 124,… | Flickr

And your poor staff gets this when trying to build a end to end secure cloud onramp WITH visibility

15+ Foursquare Kit Houses by the Aladdin Company - Everyday Old House

We have had multiple discussions on this in fact I participated in a Cisco Champions Radio Podcast a few weeks back that I highly recommend listening to. There were of course the Cisco “Experts” as well as those of us on the front lines discussing what is really going on in the trenches.

https://smarturl.it/CCRS9E

champions-radio

I see many SASE offerings and designs and most if not all terminate in public cloud. The challenge there is you pay for data in and out of public cloud so that could get very expensive quickly. I have given SD-WAN, SD-ACCESS, SD-DC and Cloud connectivity a lot of thought over the years as we tried to do end to end segmentation. Now End to End visibility is more important but we must secure our data and our users. I think a architectural shift has to happen where the traditional DC is no longer the hub. Instead SASE in a CoLo such as Equinix would provide a much better landing spot for SD-WAN, and Remote SASE users. You could even have customers enter via this Colo and that would allow end user visibility for everyone. Since Equinix and many other Colo’s offer 2ms RTT to all of the major cloud providers all of the I see many SASE offerings and designs, and most if not all terminate in the public cloud. The challenge is that you pay for data in and out of the public cloud, which could quickly get very expensive.

I have given SD-WAN, SD-ACCESS, SD-DC, and cloud connectivity much thought over the years as we tried to do end-to-end segmentation. Now, end-to-end visibility is more important, but we must secure our data and our users. An architectural shift has to happen where the traditional DC is no longer the hub. Instead, SASE in a CoLo such as Equinix would provide a better landing spot for SD-WAN and Remote SASE users. You could even have customers enter via this Colo, allowing end-user visibility for everyone. Since Equinix and many other Colo’s offers 2ms RTT to all major cloud providers, all users(branch, remote, customers) can access SAAS, PAAS, customer offerings, and older monolithic applications securely and with complete visibility. Below is what I propose, and I want to see OEM’s offer ways to do this.

Right now, looking on my crystal ball, I see three OEM’s furthest on the path, some further along than others. Cisco leads by far in this type of model. Thousand Eyes can see from the end-user to the Application but can’t see inside the public cloud, in the CNF data center infrastructure, or other remote datacenters older monolithic applications. So while we can see traffic to the SASE endpoint, we lose complete network visibility.

APPDYNAMICS can integrate with the application and Thousand Eyes, so it can see if there is a network APPDYNAMICS can integrate with the application and Thousand Eyes, so it can see if there is a network latency issue or app issue, but that’s it. I cannot tell where the issue is, only that there is the latency to the application

Appd can show a latency between the tiers of the application, but is it the underlying network or a VM that’s 100% CPU?

NEXUS Dashboard has APPd Integration to see if the network fabric is dropping packets, or is it the vNIC of the VM since the CPU is 100%? However, we do not have a single pane of glass.

Cisco is the first OEM I’ve seen to take a holistic “Data Lake” approach to network visibility from every source and apply AI and ML techniques. They are close with cloud integration and a single pane of glass, so stay tuned on this one. I should be getting some EFT code soon, so keep in tune.

VMWare has a good SASE model with Velocloud, security, and visibility. They also offer great analytics and security from NSX and NSX Intelligence and Cloud Watch; however, they do not offer an application-based tool to look at the actual application like APPd can. They also do not offer any way to look at the underlying infrastructure, just the Geneve overlay that NSX runs over, so it is complete ships in the night. They offer a “data lake” type scenario with NSX Intelligence but no near-term plans for importing SASE or Cloud Watch Data for the single pane of glass. VMWare also has an excellent security portfolio with Carbon Black, Cloudwatch NSX intelligence, and Workspace One. Frankly, you could lock the users out and use Workspace one to have them log in that way for their apps. But oops, no user experience visibility…

Arista offers no real SASE solution and typically partners with VMWare to create an end-to-end customer solution. Arista has also come out with their Network Datalake (NDL) for the data center and campus, but it’s very early. Again no integration with VMWare’s NSX or Cloud Watch platforms.

NetDL

So to sum things up; We have all been doing “hybrid work” for the 3rd year now and most companies have been successful if not even better due to selling of real estate sale, personnel not travelling and honestly business as usual. Customers , employees are happy working from home, some customers have had had record profits. So as “hybrid work” is the new norm we really need to look to SASE and new architectures to provide security AND visibility. personally I believe users and customer can land in the same SASE Geolocation. I believe SASE should not land in cloud but rather a Colo like Equinix so you don’t incur the dreaded bandwidth cost in and out, So image how much money thousand of people longlining into AWS SASE that back haul to private cloud solution would cost. In the SASE model into a Colo. So to sum things up; We have all been doing “hybrid work” for the 3rd year now, and most companies have been successful, if not even better, due to selling real estate sales, personnel not traveling, and honestly business as usual.

Customers and employees are happy working from home, some customers have had record profits. So as “hybrid work” is the new norm, we need to look to SASE and new architectures to provide security AND visibility. I believe users and customers can land in the same SASE Geolocation. I believe SASE should not land in the cloud but rather a Colo like Equinix so you don’t incur the dreaded bandwidth cost in and out, So imagine how many money thousands of people loging into a AWS SASE that backhauls to private cloud solution would cost. In the SASE model into a Colo such as Equinix, we have security from remote users at home and customers. Because using Equinix backend doesn’t get charged the same way as coming into the public cloud directly, there are significant cost savings. Hybrid work is here to stay, and we need to re-think and re-architect. Using a Colo to land all remote customers, remote users, and now branch and campus gives us a single point of security and visibility. Once the remote user, branch user, and customer connect to the SASE device in Colo, they can go to the cloud for SaaS, PaaS, and monolithic apps back in the traditional DC.

The question remains: Who leads the charge to integrate all this telemetry and open it. We are crucial time now WWT, and I are bi-partisan; we do what’s best for the customers. As we see it, Cisco has a considerable lead at providing analytics and adding it to a giant data lake to provide correlated data between multiple domains. The goal for Cisco is to integrate Thousand Eyes into NEXUS Dashboard, and we then have end-user analytics from Thousand eyes, user application performance data from APPd, and all of these use the NEXUS dashboard to correlate all the data. We eventually have a single source of truth for visibility via Thousand Eyes, APPd, and NEXUS Dashboard.

The best thing about working at WWT is that we provide solutions for the customer to be 100% successful. In difficult times it’s my team’s responsibility to vet all of these end-to-end solutions from a visibility aspect. I have no skin in the game, just reporting back what OEMs are doing, what we need to re-architect, and where hybrid work is in 5 years. One final thing to note is that network engineers must expand our horizons. As applications and users start to require different types of support, we as engineers must support them. So our skill sets must increase; it is recommended that we learn automation and programmability to apply to network devices, cloud skills in AWS, Amazon, and GCP, learning AI and ML, and finally stepping out of your silo. If you are a data center network architect, learn SASE and cloud connectivity. If you are a WAN guy, learn about SASE and Datacenter networking. An engineer with these skills are worth more money in the long run to support the new hybrid workforce; these skills are in high demand.

The Year in Review.

I always like to reflect on what we have accomplished, what I could do better next year, and the next big thing to work on and learn. It’s been a very crazy year for me with lots of ups and downs, so buckle your belts.

First, we were buttoning up the build-out of our ACI multisite fabrics in Equinix to host workloads and provide an SD-WAN landing space. We have two ACI multisite environments for production workloads and the training class, one for EFT and development testing. We have an EAST and WEST plus an Equinix site connected to the cloud via the cAPIC. We wanted this in place for the NEXUS Dashboard to provide an Application running in Equinix and onboarded into APPd. The new infrastructure then provides third-party telemetry between APPd and NEXUS Dashboard.

We can then overwhelm the servers with synthetic traffic logins and such, and we can then see latency errors and network issues and correlate them on NEXUS Dashboard. Slick demo. We also have Thousand Eyes hooked into this, and it gives us the branch and remote visibility into the Application running on ACI in Equinix. We are eagerly waiting on newer code so we can integrate Thousand eyes into NEXUS Dashboard to give that single pane of glass view end to end

When we get the new code, we can create compelling demos of end-to-end visibility and policy.

I will have to go back a little more like 15 months to Oct 2020, when we started working hard on the first NEXUS Dashboard EFT and NIR and NAE apps to run on the new NEXUS Dashboard. One cool thing was that my quote was used all over the Cisco site promoting NEXUS Dashboard.

We went through the first NEXUS Dashboard EFT, which meant I needed to re-image the boxes and go through all the factory setup steps. The lesson learned here was to screenshot each step and write some highlights of what did or did not work. I was able to use these screenshots later in the year on a WWT Platform article on setting up NEXUS Dashboard from scratch. I usually do that, but sometimes I get lazy, so you are stuck. Also, look into SnagIt from TechSmith a great tool for screen capture and annotation product. I do a tremendous amount of documentation, lab builds, and articles, which saves me a ton of time.

I spent the rest of 2021 evangelizing the NEXUS Dashboard, creating self-service labs EFTs, and writing articles and blogs. Because of all my efforts, WWT has become the premier resource for the NEXUS Dashboard and day 2 operations, so I am very proud of that accomplishment. I also hosted a webinar in April 2021 with Cisco on NEXUS Dashboard with 600 attendees and a Cisco Champions Podcast on NEXUS Dashboard. So call me Mr. Dashboard!

1956 Oldsmobile Coupe Odometer & Dash Board | 1956 Oldsmobile Coupe |  Classic Car Photo Gallery

Trouble ahead

Weather forecasting is getting a high-speed makeover | Science News
A Big Storm is Brewing

November 9th 2020, is a day I’ll always remember. I played golf that Sunday and finished up, and my wife called. She tells me that the Dr. from your physical called and said your Hemoglobin, Red Blood Cells, and Iron was super low you need to come in tomorrow. Hmm, that’s unsettling to have your Dr. call on a Sunday. However, I took that as good news as I had been battling respiratory issues for the prior 2 years, and they found blood clots, so I went on blood thinners for the last 18 months. Within 2 weeks of being on Eliquis, I felt like I had aged 25 years or that someone put Kryptonite in my pocket. I could barely walk upstairs or mow the lawn; even golf was tough to walk up to the green huffing and puffing. Plus, I also gained 50 lbs. in 18 months. I knew Eliquis could have some weird side effects such as massive weight gain and low Iron issues, which also cause RBC and Hemoglobin problems. Doctors thought I was crazy and made all of it up, no joke.

Going to the internist, she felt I had an internal bleed, but I was convinced that the Eliquis all along caused my issues. A colonoscopy and Endoscopy got scheduled for 11/22/2020(another day of remembrance). I woke up to see my wife and Dr with somber faces. Yes, you have Colon Cancer. It turns out I’ve had it probably 5 years, the cancer would throw clots to the lungs, and I would have breathing issues then go away. This went on for years. Finally, they saw the clots in a CT scan and put me on Eliquis, which caused the tumor to bleed hence my low Hemoglobin, Iron, and RBC turning me into Superman with Kryptonite in his pocket.

The following 2 weeks before surgery were a blur had to get all kinds of legal stuff to get in a place like wills, power of attorney, etc. I saw my life flash before me, but it took two weeks. It gave me a lot of time to reflect on my life, family success, and work success. Laying in the OR, I felt good at where I was in life, no regrets, I had no idea if I was going to wake up after the surgery on December 4th 2020(another day remembrance). I had a magician for a surgeon, 2 feet of colon removed, no evidence of any cancer spread. I had to go through 6 months of chemo, which was hell and still affects me. I’m happy to report that I am 1 year clear of cancer as of two weeks ago.

I try to keep personal things out of my blogs, but this was a significant time in my life. The lowest point of my life was depressing, and I felt I would never be normal again. WWT was great; I could get 6 weeks of paid FMLA to recover. Within 2 weeks, I was feeling good, and over the winter break, I was back and building the virtual NEXUS Dashboard labs on the sly.

I threw all of my focus at work instead of fixating on cancer, it was the one outlet that offered me great support, and I was able to get positive feedback from all my accomplishments. So the takeaway from this long story is that when you feel down, look to something you excel in and excel in it. Doing this gets you accolades, makes you feel like you accomplished something, and helps you out of the funk. Even while I was hooked up to my chemo port for 6 hours, I was working away since it focused on work and remaining positive.

The next big task for last year was Arista. WWT has always had very close alliances with Cisco and is Cisco’s most significant partner by a wide margin, and we are very hesitant to work with other partners due to the excellent working relationships we have with Cisco. That said, we have a few huge customers who combine Cisco and Arista VXLAN pods. So that means we sell a lot of Arista even though we don’t directly go out and sell it over a Cisco solution.

I was tasked in Feb 2021 to be the Arista guy for the GSD team and learn it and build some customer-facing labs. Our team also needed to pass the ACE 3 and ACE 5 exams to become Elite partner and gain extra margin from the Arista sales we do already and frankly be able to talk with customers that truly want to go Arista instead of Cisco. In the VAR space, it is challenging to placate both the OEMs and the customers simultaneously.

At this time I was doing chemo and super sick so I really threw myself into learning Arista and creating a lab environment that my co-workers could use to practice on Arista VXLAN configurations. The lab allowed us all to pass the ACE 3 exam. One of my co-workers Justin Van Sheik created a ACE-5 I was doing chemo and super sick at this time, so I threw myself into learning Arista and creating a lab environment that my co-workers could use to practice on Arista VXLAN configurations. The lab allowed us all to pass the ACE 3 exam. One of my co-workers, Justin Van Sheik, created an ACE-5 (automation ) lab and passed that so now we are an Arista premier partner and entitled to the extra margin and able to help the customer that wants to go down the Arista path instead of Cisco for their reasons. So we are building out Arista labs for our customers that want to use Arista and we build out Cisco labs for customers that want Cisco. Its really as simple as that. We must listen to what our customers want and pivot when they ask us for help.

ACE: L3 (Arista Cloud Engineer: Level 3) - Credly

The great thing about WWT is that we put the Customers First. Always. WWT’s policy is 100% customer satisfaction, not 100% OEM satisfaction. This is where in my opinion, the most important thing comes into play between a VAR and a Customer; TRUST. However, there also must be a high level of trust between the VAR and OEM, so a very small tightrope over the Grand Canyon is how WWT must tread. Trust is very hard to obtain once you have it with a customer or OEM; everything you say is believed. But betray that trust, and it is very hard to get back.

The Neuroscience of Trust | Psychology Today

So as we close this out, we did a fantastic Cisco Champions Radio Podcast on hybrid work that comes out next Tuesday, February 8th. Just go to Youtube and search on Cisco Champions. We are all fully aware that Covid is not going away, and second, the workforce itself has reset. Every month, millions of workers in the US and 10’s of millions across the globe reassess their lives and how they work. I can relate to having recovered from cancer and looking at what I want to do with my career? Most companies had large workforces pivot overnight to working from home. The workforce has forever changed from traditional Campus, Branch, DC to a consumer of applications wherever their laptop is, to the provider of applications wherever the application is.

Cisco Champion Radio - Cisco

I am looking forward to connecting SASE models into public and private cloud and how Colo’s plays a huge part in this. We are restructuring WWT to help customers with this, but we see the OEM are a little slow to help provide that end-to-end solution and visibility.

Finally, I am incredibly proud to announce my promotion to Principle Solutions Architect back in December. The PSA is the second-highest pinnacle of engineering at WWT and is voted on to get there. I look forward to the next year of providing hybrid solutions to connect users and their applications wherever they reside and providing telemetry and analytics via ML and AI with our AIOps team.

I also want to thank everyone who offered me support for getting through my challenges and for those who collaborated this year to make an incredible year for me!

Confetti With Golden Colorful Consept, Confetti, Celebrate, Gold PNG and  Vector with Transparent Background for Free Download

The Impact Sprinkler

My Work life

I’m sure a lot of us feel this way at work we get bombarded all day with things coming out of the left field. Can you jump on a customer call, this doesn’t work can you take a look. Oh, check out this new thing. On and on and on. every day we act like impact sprinklers working our way throughout the day watering the lawn. Then all of a sudden its 4 PM and you need to get back to what you were doing at 8 am. Tick, tick, tick, tick then Brrrrrrr back to the start. Some days you can get a few hours in but there is always something. It’s great to feel needed and help out but a curse as well.

So anyway, this week was a short one but worked on spinning up new ACI 4.2 simulators so we can finish up creating the new ACI 4.2 lab guide. Then we ran into connectivity issues on the VLAN the simulators were on so had to troubleshoot some TOR FEX issues. The went through some Doc as code training(yes the crusty old guy is trying to learn some coding). Then worked with the automation team to figure out what to change for the new 4.2 ACI class baseline. Then worked on our presentation for next Friday doing a customer highlight how our team helped the filed close a big ACI deal. I also worked on my ACI multisite simulator lab as well as the fabrics that are getting moved to Equinix.

Of course, there were multiple calls from the field asking some ACI licensing and sizing questions as well as some ACI best-practice conversations. I also started working on upgrading our Vcenter for the Multi-site simulator we are building to 6.7. I ran into a really weird issue with going from Vsphere 6.0 to 6.7. The install of the 6.7 VSCA went fine, however, when importing in it kept error out saying duplicate network names in a folder. I had 2 ACI created DVS, one for each of the multi-sites with EPG’s (port groups) from the same tenant stretched. OK, figured it didn’t like that so got rid of those and tried a ton of other things. It kept saying a name and then it dawned on me.

Notice that the Network folder is called Management. Also notice that the DVUplinks are called Management-DVUplinks-23 and there is a port group called Management. So yep you guessing it, I had to change the Management in the name of the DV uplink as well as port group. Once I did that upgrade went fine.

So its 10 PM and I finally got this working time to shut off the sprinkler and head for an adult beverage!

In all seriousness hopefully, someone sees this, I banged my head for a while on this one.

Last day of support and migrations

My role over the last dozen years or so is to help guide customers make the right decision on architecture and product selection to meet their business needs. After meeting with hundreds of companies over the years, a large trend I see over and over is what is known as the Ostrich effect. Look it up its a real term. IT organizations think they have 5 years to replace the hardware so they stick their head in the sand for a while and ignore it.

What happens is that when the manufacturer of a product or software the organization run their critical business applications on has announced End of Sale(EOS) staff and directors think they have plenty of time. This then typically start the process of that product be it hardware or software going Last Day of Support(LDOS). There is usually a 4-5 year time between these times and this is where most companies have the “Ostrich Effect” and put it out of their mind as they have 5 years to replace the gear. These are two completely different things with very different ramifications. Last day of support(LDOS) means just that, no more tech support, no RMA’s, no software updates for bug fixes or security vulnerabilities. The biggest issue that one IT director told me keeps him up is the software updates. I can only talk about Cisco;s recommendations and they recommend being off LDOS equipment 12-18 months before the actual LDOS date. So why do they recommend this? They typically stop writing updates, bug fixes and security patches a year or more before the actual LDOS date. So now your LDOS is no longer 5 years, but 3 1/2 -4 years way

Organizations should not have any End of Support devices running critical business applications, but I see it over and over again. If its only a couple of devices connected to end users, some wireless access points its not that huge of a risk. However if your core networking devices and half of your access layer or your SAN storage is going end of support you really need to pay attention.

The most critical thing most customers miss is the time it takes to migrate to new gear especially if they are moving from traditional 3 tier architectures to spine leaf and SDN controller infrastructure.

Lets look at an example of a data center migration timeline. This would be an organization with 6000 workloads that is moving to a spine leaf SDN based architecture.

This is a customer that has a ton of Cisco 5548’s that are End of Sale, and the hardware LDOS date is 2022. Just designing, procuring and building the new infrastructure will take a good 9-12 months to complete and test pre-production workloads. Then move groups are created and workloads migrated. You cannot move all 6000 workloads in a weekend so this must be carefully planned with application owners, hardware owners etc. By breaking up this into 6 move groups every 3 months with a 1000 workloads that is moving approx 100 workloads every single week for 60 weeks. Think of that coordination of for a over a year you move 100 workloads each and every week, and at the same time doing your everyday job. What we see is that we will run into the point where software support will stop (approx 1 year) and we have only migrated half the workloads

Usually this is the aha moment and they pull their heads out of the sand and realize they have a issue and have run out of time to act.

I encourage anyone reading this to look at what has been announced End of Sale in you or your customers critical infrastructure and create a migration chart like I have above. I think you will find that the time between EOS and LDOS is a lot shorter than you think.

2019 Equinix multisite build out

In November 2019 we got the gear for Equinix to build out Multisite extensions from our on-prem Multisite demo and dev environemnts out to Equinix. This will then allow us to build out new AWS and Azure connectivity. Since Equinix is a lights out site and you need to pay for smart hands, it was decided to build out the environments in the ATC using the Equinix IP space, connect to the existing prod demo and dev multisites, connect the UCS and storage and vcenters up using IP space allocate for Equnix. Once built we simply ship, have our remote guys go out and rack and stack, then flip the routing to point to Equinix instead of the ATC. Right now as or this post the ACI portion is up, we need to configure the UCS and storage connectivity then going the vcenters for the two different multisites.

2019 Service Engine and MSITE simulator

One of the ideas I had in July was to port over our ACI class so that all labs could be done using the simulator. These would be self service labs that can be spun up and there would be lecture material as well as the labs so students could go through our ACI training in a self study 2-3 hour module approach. The second piece was that they would build out 2 separate sites using the ACI 4.2 simulator, then build out the multisite orchestrator (MSO) and run through a series of labs to build the MSO, imort the sites and deploy tenants, templates and schemas from the MSO to the virtual simulators. The simulators are control plane only so a student could not pass traffic or ping, but they would be able to take the lab guide and build out a full working multisite environment using the labs and discussion material. I did the Proof of Concept(POC) for it using our vCloud Director environment and things worked fine. I built out all of the lab modules and we were going through QA testing and we are finding timeouts in the ACI GUI and other errors. Right now we are working with the developers to debug the issues once we figure out the issue ill be able to release 4-5 new labs based on this. The simulated environment will look like the image below. One important thing is that the connectivity to the NEXUS 7710 for the border leaf and IPN’s do not exist. However in the lab students will create the simulator side, and screen shots of the live environment will be used to show configuration and verification to be performed in a real install.

We also obtained a EFT version of the service engine to house NAE and NIR, took a bit to get and have racked and connected. Once we tried getting the EFT code on we ran into issues getting the code to load. After some serious debugging we found the UCS boxes had the wrong certificate on the TPM module so the software wouldn’t load. Cisco is working with us remotely after the holidays to get the new certs installed so we can test out the service engine, the new NAE code and new NIR code. Ill be documenting that as we go along.