HPC in Modern Data Applications — An Introduction

Data-driven computing coupled with artificial intelligence (AI) is the crown jewel of modern computing. From the 1980s to around 2010, High-performance computing (HPC) was limited mostly to scientific applications like large-scale simulations running on supercomputers in national labs. The hardware was expensive and the applications were limited, keeping HPC from mainstream software development and deployment.

HPC is a branch of the broader field of distributed computing and specializes in applications that demand large amounts of computing and resources to produce answers. In the last decade, AI as a broad category including deep learning has emerged as compute-intensive applications expanding the application pool suitable for HPC. Unlike many previous applications that were suitable for HPC, like simulations, AI applications depend on processing large datasets.

Today, the term HPC is often used without explanation in various contexts. This article aims to explore what HPC is and how it relates to modern applications. We will begin by discussing general distributed computing and then delve into high-performance computing and its connection to modern data applications.

Distributed Computing

Distributed computing is a branch of computer science that focuses on using multiple computers to serve users. These computers work together by passing messages between them using computer networks. It is ubiquitous in modern applications and any computer user will use some form of distributed computing applications in their everyday activities.

The most famous use of distributed computing is through the World Wide Web. All our favorite web applications like search, streaming services, and social media networks run as distributed applications with hundreds of thousands of computers storing the data, processing them, and serving users connected to them through the internet.

Distributed computing architectures

Client-server architecture is the most widely used in distributed computing. Many clients connect to a set of servers and send requests to them. The servers do the computations and reply to the clients. This architecture is extended to a three-tier architecture where there is an additional data layer behind the server storing and serving the data.

Modern application architecture is extending this further to create n-tier distributed systems where there are many layers both hierarchically and horizontally connected to serve user requests. n-tier architecture consists of many distributed systems working together in different layers to create complex applications. A few examples of these distributed systems that are used in modern applications are

  1. Web services
  2. Streaming and messaging systems
  3. Data storage and processing systems
  4. Machine learning systems

These distributed systems can follow many architectural patterns depending on their processing requirements.

Why do we need distributed systems?

In the World Wide Web case, it is apparent that we need distributed systems as applications need to serve users distributed across the globe. But what about distributed systems inside an organization? Let's look at some of the reasons we need distributed systems.

  • Fault tolerance — Distributed systems allow applications to work amidst failures. If one computer fails another is available to take its place.
  • High availability — When applications use multiple computers they can be upgraded and maintained reducing the downtime of the overall application.
  • Scalability — To do more computations either because we have more data to process or we get more requests, we need multiple computers.
  • Efficiency — A single computer is limited in resources like memory, network, and disk I/O. Some applications need more resources than a single computer can provide. Having multiple computers can lower the latency and throughput of applications increasing their efficiency.

Loosely coupled vs. tightly coupled systems

In client-server architecture systems, the server doesn’t require all clients to be connected simultaneously for operation. Clients can connect and disconnect as needed. Such systems are termed loosely coupled distributed systems.

Conversely, in a data processing application utilizing multiple computers, all computers may need to remain connected throughout the computation. If a computer disconnects while participating in a computation, it may cause the entire computation to fail. These systems are referred to as tightly coupled distributed systems.

Compared to loosely coupled applications where computers are largely independent, tightly coupled distributed applications require more careful design, deployment, and execution to ensure effective and efficient operation.

High-performance computing (HPC)

High-performance computing, in short HPC, is a branch of computing where it looks specifically at tightly coupled distributed systems and applications that demand large amounts of computing. Here is a quote from IBM about the definition of HPC.

HPC is technology that uses clusters of powerful processors, working in parallel, to process massive multi-dimensional datasets (big data) and solve complex problems at extremely high speeds. HPC systems typically perform at speeds more than one million times faster than the fastest commodity desktop, laptop or server systems.

HPC applications are developed to take advantage of HPC clusters by working closely with the hardware available in these systems. An HPC cluster will have several computers called nodes connected by a high-performance network. Typically these nodes are connected to a storage system that can serve the data needs of the applications. The compute nodes might be pure CPU-based or might be equipped with hardware accelerators like GPUs.

For the longest time most well-known HPC clusters have been the supercomputers. Now with the availability of hardware that can work at incredible speeds, it is easier to design high-performing computing clusters for everyday applications. Because of ubiquitous access to HPC systems application developers need to rethink their designs to leverage the best of high-performance clusters.

Applications that work with data either for storage or processing have become some of the most important in modern times with the big data revolution and the AI boom. These applications are specifically suited to leverage high-performance systems due to compute and storage requirements.

Data-intensive applications

Data-intensive applications work with massive amounts of data. Massive is a relative term. For some use cases, it can mean petabytes of data while for others it can be gigabytes or less.

Data-intensive applications can be compute-bound or I/O bound. In both cases, we need powerful computers to process data in a reasonable amount of time.

We rely on software frameworks designed for specific use cases to develop, deploy, and execute data-intensive applications. There are few broad categories of data-intensive applications and many frameworks have been developed to support different variations of these applications. These application categories are

  1. Online data ingestion and processing
  2. Storage and retrieval systems
  3. Data processing systems
  4. Machine learning training
  5. Machine learning inference

A data-intensive application runs on top of a distributed system or a framework destined to execute it. At runtime, the application becomes part of the framework and the framework becomes part of the application. So we can use the terms framework (system) and application interchangeably in the context of data-intensive applications.

Distributed data applications have the following characteristics.

  • They are written using an application programming API supported by the underlying framework. Popular APIs include SQL, data frames, tensors, and arrays.
  • At runtime, these applications require substantial computing resources, including memory, CPUs, GPUs, disks, and networking capabilities.

Online Data Ingestion and Processing

Data on the move needs to be analyzed and acted upon before it can be stored in permanent storage. These applications need low-latency processing and high availability and are supported by frameworks like

  1. Message brokers
  2. Streaming processing systems
  3. Microservices

Storage and retrieval systems

Storage systems provide the infrastructure for data processing systems. They store raw data, structured data, and in-between. Depending on the data size and read/write requirements, we might need a few computers to thousands of computers to store the data.

  1. Distributed storage systems — File systems like HDFS and object storage
  2. File storage formats for structured and unstructured data.
  3. Data querying languages and APIs (SQL, Dataframes)

Storage systems have strict requirements for fault tolerance and availability. They need to be highly available and function without data loss even if some of the computers holding the data fail.

Data Processing Systems

Data processing systems convert the data into knowledge and intelligence. Depending on the use case they need complex distributed systems to serve user requests. Unlike storage systems, these systems do not need strict fault tolerance or data guarantees. In most cases, if a computer fails while a computation is happening it can be restarted. It only incurs a latency increase and no data loss. The performance of processing systems is measured in throughput and latency.

Data-intensive applications are diverse and encompass a wide variety of techniques. To truly understand how they work and scale we need a deeper understanding of underlying distributed systems.

Machine Learning Training

Modern large language models boast billions or even trillions of parameters, necessitating days or months of training on extensive clusters. Given their reliance on a large number of accelerators exchanging data between training iterations, the seamless coordination of high-performance networks and computing infrastructure is essential to ensure the continuous engagement of these accelerators.

Machine Learning Inference

Machine learning inference is also a compute-intensive task. The majority of machine learning inference can be done in a single accelerator but bugger models need to be distributed in multiple accelerators. Inference is a compute-intensive task and demands high-performance networking when deployed in multiple accelerators.

Why HPC?

Few factors are compelling data applications to be closer to hardware than ever before and they are

  1. With the end of Moore’s Law for CPUs, we can no longer depend on hardware performance to exponentially increase and boost our application performance. It’s imperative to optimize the utilization of existing hardware to be cost-effective and meet performance demands.
  2. Additionally, hardware such as GPU clusters are costly, and not maximizing their performance can result in wastage of resources and money. To gain the best performance, we need to rely on hardware such as high-performance networks, data storage, and accelerators like GPUs.

How does HPC Fit in?

For machine learning training and inference, HPC plays a crucial role as these are compute-intensive applications that require coordination among a large number of computing units, such as CPUs, GPUs, or custom hardware.

Data processing sits at the intersection of HPC and I/O-driven applications. While they can leverage HPC techniques, the gains are not as substantial as for ML applications because a significant portion of their runtime is spent on I/O operations. Nonetheless, the ability to utilize high-performance networks and storage can greatly benefit these applications.

Storage systems usually do not require processing power. Instead, they need to support HPC hardware so that applications can fully utilize them. Online data ingestion and processing share similarities with storage systems as they operate in an I/O-driven manner, focusing on handling data in real time.

Overall, certain applications necessitate High-Performance Computing (HPC) without any alternative. However, in cases where the benefits are partial, resource limitations and costs can influence decisions regarding HPC usage. Nevertheless, it’s generally agreed that designing applications to leverage HPC could be beneficial across the board.

Summary

HPC is transitioning from scientific use to broader applications like machine learning and data processing. Traditional distributed computing applications like data processing are being reworked to utilize HPC. HPC is becoming a huge part of modern applications like machine learning, data processing, and storage systems and it is ever so important to architect these applications to take advantage of high-performance hardware systems.

--

--