Interviews

Kevin Tubbs, PhD, SVP Strategic Solutions Group at Penguin Computing – Interview Series

Published

4 years ago

October 1, 2020

Kevin Tubbs, PhD, is the Senior Vice President of Strategic Solutions Group at Penguin Computing. Penguin Computing custom designs agnostic, end-to-end (hardware/software/cloud/services) solutions to solve the complex scientific, analytical and engineering problems facing today's Fortune 500 companies, startups, academic institutions, and federal organizations

What initially attracted you to the field of computer science?

My mom and dad bought me a computer when I was very young, and I’ve always had an interest and knack for computers and tinkering. Through my education I consistently gravitated towards STEM fields and that led me to want to be involved in a more applied field. My background is physics and High Performance Computing (HPC). Having a love for computers early on allowed me to keep computer science at the forefront of any other science, math or engineering interest that I’ve had, which has led me to where I am today.

Penguin Computing works closely with the Open Compute Project (OCP) – what is that precisely?

Since the start of the Open Compute Project (OCP) movement, Penguin Computing has been an early adopter, supporter and major contributor to the effort to bring the benefits of OCP to High Performance Computing (HPC) and artificial intelligence (AI).

The focus of OCP is bringing together a global community of developers to create a full ecosystem of infrastructure technology reimagined to be more efficient, flexible and scalable. Penguin Computing joined OCP because of the Open technologies and the idea of a community. What we’ve done over time is ensure that the heritage and technologies from traditional HPC and emerging trends in AI and Analytics can scale efficiently – Penguin Computing drives those things into OCP.

One of the benefits of OCP is that it lowers total cost of ownership (TCO) – lower capital expenses, thanks to removal of all vanity elements, and lower operating expenses due to service from the front, shared power and other design changes – which makes OCP-based technology perfect for scale out.

Penguin Computing has several OCP products including the Penguin Computing Tundra Extreme Scale Platform and Penguin Computing Tundra AP. The Tundra platforms are also compatible with HPC and AI workloads.

Tundra AP, the latest generation of our highly dense Tundra supercomputing platform, combines the processing power of Intel® Xeon® Scalable 9200 series processors with Penguin Computing’s Relion XO1122eAP Server in an OCP form factor that delivers a high density of CPU cores per rack.

When it comes to big data, to optimize performance levels users need to remove bottlenecks that slow down their access to data. How does Penguin Computing approach this problem?

Penguin Computing has leveraged our ability to use Open technologies and move fast with current trends – one of which is big data or the growth of data and data driven workloads. In response to that, we’ve built out our Strategic Solutions Group to address this problem head on.

In addressing the problem, we’ve found that the majority of workloads, even from traditional technical compute, are all motivated to be more data driven. As a result, Penguin Computing designs complete end-to-end solutions by trying to understand the users workload. In order to create a workload optimized end-to-end solution, we focus on the workload optimized software layer which includes orchestration and workload delivery. Essentially, we need to understand how the user will make use of the infrastructure.

Next, we try to focus on workload optimized compute infrastructure. There are varying levels of data and IO challenges which put a lot of pressure on the compute part. For example, different workloads require different combinations of accelerated compute infrastructure from CPUs, GPUs, memory bandwidth and networking that allows that data to be flowed through and be computed on.

Finally, we need to figure out what types of solutions will allow us to deliver that data. We look at workload optimized data infrastructures to understand how the workload interacts with the data, what’s the capacity requirements and IO patterns. Once we have that information, it helps us design a workload optimized system.

Once we have all the information we leverage our internal expertise at Penguin Computing to architect a design and a complete solution. Knowing it’s designed from a performance perspective, we need to understand where it’s deployed (on premises, cloud, edge, combination of all, etc.). That is Penguin Computing’s approach to delivering an optimized solution for data driven workloads.

Could you discuss the importance of using a GPU instead of a CPU for deep learning?

One of the biggest trends I’ve seen in regards to the importance of GPUs for Deep Learning (DL) was the move from using general purpose GPUs (GPGPU) as a data parallel piece of hardware that allowed us to massively accelerate the amount of compute cores that you can deliver to solve a parallel computing problem. This has been going on for the last ten years.

I participated in the early stages of GPGPU programming when I was in graduate school and early on in my career. I believe having that jump in compute density, where a GPU provides a lot of dense compute and analytics cores on a device and allows you to get more in a server space and being able to repurpose something that was originally meant for graphics to a compute engine was a real eye-opening trend in HPC and eventually AI communities.

However, a lot of that relied on converting and optimizing code to run on GPUs instead of CPUs. As we did all of that work, we were waiting for the concept of the killer app – the application or use case that really takes off or is enabled by a GPU. For the GPGPU community, DL was that killer application which galvanized efforts and development in accelerating HPC and AI workloads.

Over time, there was a resurgence of AI and machine learning (ML), and DL came into play. We realized that training a neural network using DL actually mapped very well to the underlying design of a GPU. I believe once those two things converged you have the ability to do the kinds of DL that was not made possible previously by CPU processors and ultimately limited our ability to do AI both at scale and in practice.

Once GPUs came into place it actually re-energized the research and development community around AI and DL because you just didn’t have the level of compute to do it efficiently and it wasn’t democratized. The GPU really allows you to deliver a denser compute that at its core is designed well for DL and brought it to a level of hardware architecture solutions that made it easier to get to more researchers and scientists. I believe that is one of the big reasons GPUs are better for studying DL.

What are some of the GPU-accelerated computing solutions that are offered by Penguin Computing?

Penguin Computing is currently focused on end to end solutions being worked on by our Strategic Solutions Group, particularly with Penguin Computing’s AI and Analytics Practice. Within this practice we’re focused on three high level approaches to GPU-accelerated solutions.

First, we offer a reference architecture for edge analytics, where we’re looking to design solutions that fit in non-traditional data centers (out at the edge or near edge). This can include Teleco edge datacenters, retail facilities, gas stations and more. These are all inference based AI solutions. Some solutions are geared towards video analytics for contact tracing and gesture recognition to determine if someone is washing their hands or wearing a mask. These are applications of complete solutions that include GPU-accelerated hardware that is fine-tuned for non-traditional or edge deployments as well as the software stacks to enable researchers and end-users to use them effectively.

The next class of Penguin Computing solutions are built for data center and core AI training and inferencing reference architectures. You could think about sitting inside of a large scale data center or in the cloud (Penguin Computing Cloud) where some of our customers are doing large scale training on using thousands of GPUs to accelerate DL. We look at how we deliver complete solutions and reference architectures that support all of these software workloads and containerization through GPU design and layout, all the way through the data infrastructure requirements that supports it.

The third class of reference architectures in this practice is a combination of the previous two. What we’re looking for in our third reference architecture family is how do we create the data fabrics and pathways and workflows to enable continuous learning so that you can run inferencing using our edge GPU-accelerated solutions, push that data to private or public cloud, continue to train on it, and as the new training models are updated, push that back out to inferencing. This way we have an iterative cycle of continuous learning and AI models.

Penguin Computing recently deployed a new supercomputer for LLNL in partnership with Intel and CoolIT. Could you tell us about this supercomputer and what it was designed for?

The Magma Supercomputer, deployed at LLNL was procured through the Commodity Technology Systems (CTS-1) contract with the National Nuclear Security Administration (NNSA) and is one of the first deployments of Intel Xeon Platinum 9200 series processors with support from CoolIT Systems complete direct liquid cooling and Omni-Path interconnect.

Funded through NNSA’s Advanced Simulation & Computing (ASC) program, Magma will support NNSA’s Life Extension Program and efforts critical to ensuring the safety, security and reliability of the nation’s nuclear weapons in the absence of underground testing.

The Magma Supercomputer is an HPC system that is enhanced by artificial intelligence and is a converged platform that allows AI to accelerate HPC modeling. Magma was ranked in the June 2020 Top500 list, breaking into the top 100, coming in at #80.

Under the CTS-1 contract, Penguin Computing has delivered more than 22 petaflops of computing capability to support the ASC program at the NNSA Tri-Labs of Lawrence Livermore, Los Alamos and Sandia National Laboratories.

What are some of the different ways Penguin Computing is supporting the fight against COVID-19?

In June 2020, Penguin Computing officially partnered with AMD to deliver HPC capabilities to researchers at three top universities in the U.S. – New York University (NYU), Massachusetts Institute of Technology (MIT) and Rice University – to help in the fight against COVID-19.

Penguin Computing partnered directly with AMD’s COVID-19 HPC Fund to provide research institutions with significant computing resources to accelerate medical research on COVID-19 and other diseases. Penguin Computing and AMD are collaborating to deliver a constellation of on-premises and cloud-based HPC solutions to NYU, MIT and Rice University to help elevate the research capabilities of hundreds of scientists who will ultimately contribute to a greater understanding of the novel coronavirus.

Powered by the latest 2nd Generation AMD EPYC processors and Radeon Instinct MI50 GPU accelerators, the systems donated to the universities are each expected to provide over one petaflop of compute performance. An additional four petaflops of compute capacity will be made available to researchers through our HPC cloud service, Penguin Computing® On-Demand™ (POD). Combined, the donated systems will provide researchers with more than seven petaflops of GPU accelerated compute power that can be applied to fight COVID-19.

The recipient universities are expected to utilize the new compute capacity across a range of pandemic-related workloads including genomics, vaccine development, transmission science and modeling.

Anything else you’d like to share about Penguin Computing?

For more than two decades, Penguin Computing has been delivering custom, innovative, and open solutions to the high performance and technical computing world. Penguin Computing solutions give organizations the agility and freedom they need to leverage the latest technologies in their compute environments. Organizations can focus their resources on delivering products and ideas to market in record time instead of on the underlying technologies. Penguin Computing’s broad range of solutions for AI/ML/Analytics, HPC, DataOps, and Cloud native technologies can be customized, and combined to not only fit current needs, but rapidly adapt to future needs and technology changes. Penguin Computing Professional and Managed Services help with integrating, implementing, and managing solutions. Penguin Computing Hosting Services can help with the “where” of the compute environment by giving organizations ownership options and the flexibility to run on-premises, on public or dedicated cloud, hosted or as-a-service.

Thank you for great interview, readers who wish to learn more should visit Penguin Computing.