Machine Learning on a Supercomputer

9 min readFeb 27, 2022

I am a corporate student at HPE as such I had the opportunity to work on a supercomputer (a.k.a High Performance Computer) in my last internship at the Center of Excellence. I could apply some knowledge from my study course Data Science by exploring possibilities to combine Machine Learning (ML) and High Performance Computing (HPC).

I discovered that it is quite hard to find information for newbies like me on the topics of HPC and ML. When my Artificial Intelligence professor suggested that we could do a Blog Post as part of the lecture I already had this post on my mind. Furthermore, I want to share my experience and some helpfull resources with you.

“Innovation is taking two things that exist and putting them together in a new way.“
— Tom Freston

In the spirit of this quote we will dive into two different subjects and then investigate how to combine them to open up new possibilities. We will start with an introduction into High Performance Computing and then dive into the Machine Learning world. I will keep it high-level, as this is supposed to be a starting point for your journey in ML/HPC. I want to spark your interest for further research into these topics.

This article will cover the following:

Introduction to Supercomputers
Architecture of Supercomputers
Introduction to Machine Learning
How to combine Supercomputers and Machine Learning

1. Introduction to Supercomputers

The history of Supercomputers begins in 1964 with the CDC6600 — a computer that was almost three times faster than the previous fastest computer [1] [2]. Programms for these powerfull machines were written in low-level languages like Fortran, Cobol, Basic and later in C [3]. Until today high-level languages like Python and Java are mostly avoided in high performance computing. The reason is that they abstract the details of the computer for ease of use and leave the developers with less options to use the underlying system efficiently [4]. This usually leads to a decrease in performance. However, as we will see later Python is a high-level language with highly optimized Machine Learning libraries. Therefore, it is used for Machine Learning on supercomputers.

Supercomputers are used for advances in complex topics like fluid dynamics, drug discovery, climate simulation, astronomy and DNA sequencing. These disciplines usually require large amounts of data and computing power to simulate scenarios, estimate solutions or operate on massive simultaneous equations.

Since the invention of the computer their performance consistently improved enabling individuals, researchers and organizations to run more sophisticated software. You might assume that a supercomputer has one giant circuitry with thousands of processors. Indeed, modern supercomputers have thousands of processors, however, they are not integrated into one circuit. There are several constraints to large circuit systems such as capabilities to remove heat and supply power, limits in the minuteness of transistors, the speed of light [5]. As a consequence Supercomputers are actually networks of individual compute nodes, that have their own circuitry.

2. Architecture of Supercomputers

architecture of a supercomputer also known as high performance computer — Figure 1: Architecture of a Supercomputer (High Performance Computer)

Figure 1 shows the abstract architecture of a supercomputer.

Compute Nodes: On the top left you can see that the circuit of one compute node has two CPUs (Central Processing Units) and memory on its circuit. Optionally, modern High Performance Computers have up to 8 GPUs (Graphic Processing Units) connected to a node. GPUs function as a co-processor to the CPU they are traditionally used to render and display thousands of pixels in video footage simultaneously. Yet, their massively parallel computing cores and high bandwidth make them ideal for data-intensive tasks such as Machine Learning and simulations [6]. GPUs receive their instructions by the CPU, to integrate this logic into an application parallel computation platforms like CUDA, OpenCL and OpenGL.
Login Nodes: On the left side of figure 1 you can see dedicated compute nodes — the login nodes — that are interconnected to other nodes in the supercomputer. These nodes are accessible to the Supercomputer users over the internet or a network that the users are connected to. The other nodes are usually not directly accessible. To authenticate and establish a secure connection to the login nodes SSH (Secure Shell) is used[7]. On the login nodes users can modify their code, transfer data and prepare a job, however, they are not used to run HPC applications. A job is similar to a shell script it can contain environment variables, data sources, output destinations, parameters and instructions. Additionally, job scripts contain magic cookies that specify how to execute the script [8]. Script 1 shows an exemplary jobscript that specifies the following: the script is executed with BASH, it is submitted to the queue named “default_job_queue”, one “rome” node with 128 processes is selected, 128GB of memory are requested, the wall time until the job is terminated is 24 hours, the job gets the name “my_job_name” and the output (including errors) of the job is written to a file named “logged_output_and_error.out”. You can learn more on jobscripts here [9].

Script 1: Example magic cookies for the PBS batch system jobscript

Batch System / Job Scheduler: Users can submit their jobscript from the login nodes using the command for the batch system. There are several batch systems including SLURM, PBS Pro, Cobalt, LSF. They offer different tools and handle submitted jobs slighly different. Nonetheless, each of them has a command line tool to submit a jobscript along with some parameters. These parameters are usually the same as the magic cookies an example of this is shown below. The job script is read by the batch system and queued afterwards. If prior submissions in the queue were processed and the specified hardware is available the job scheduler will claim the specified resources. If multiple nodes were requested this group of nodes is now called a cluster. At this point the job is executed on that cluster.

qsub -q default_job_queue -N my_job_name pbs_job.sh

Cluster: All compute nodes are connected with high throughput interconnects such as InfiniBand, Slingshot or Tofu interconnect. They are also interconnected with the storage nodes that offer a parallel file system with high bandwidth to achieve state-of-the-art throughput. The fast connection is necessary to pass messages between nodes that compute parts of the same problem and to read and write files fast.

3. Introduction to Machine Learning

“[…] Machine Learning (ML) is a category of artificial intelligence that enables computers to think and learn on their own.“
— Alzubi et al.

Machine Learning (ML) is a discipline that combines computer science and statistical methods to create algorithms that find patterns and connections in data. Furthermore, ML algorithms are not explicitly programmed to find specific patterns. These patterns are learned from observations. Recommending products, identifying faces in pictures and communicating through a chatbot are some examples of ML.

Nowadays, immense amounts of data are collected, shared and processed by scientists, organizations and individuals. To gain information from diverse and complex data sources it is unpractical and in some cases impossible to write explicit instructions to extract the relevant information. The process of discovering previously unknown patterns in data is also refered to as Data Mining [10].

To ease your start into the field of ML some terminology has to be clarified. For example the relation of Artificial Intelligence (AI), Machine Learning (ML) and Deep Learning (DL) for example is often confused. Therefore, the venn-diagramm in figure 2 shows the relation of these terms.

Venn diagramm artificial intelligence, machine learning, deep learning — Figure 2: Venn-diagramm AI, ML, DL

Artificial Intelligence (AI): AI is concerned with computer programs that exhibit intelligent behavior. This definition is quite vague, thus, the European Union published a report outlining traits that an AI must have [11]:
- Perception of the environment
- Information processing
- Decision making
- Achievement of specific goals
Machine Learning (ML): ML is a subset of AI. While AI includes explicit programming that emulates human intelligence, ML only includes data-driven algorithms that learn from observations. Statistical methods are used to learn a model of the data and its features. That model can later be used to extract information from previously unseen data [12]. For example a self-driving car would stop at a red light even-though it might not have seen this red light before. Instead it has learned the features of a traffic light and it also learned the difference between a green light and a red light.

Deep Learning (DL): DL is a subset of ML. DL uses Artificial Neural Networks (ANNs) that simulate how a biological brain would find patterns in the world. A deep Neural Network (NN) is illustrated in figure 4. Such a deep NN is composed of multiple neuron layers. The first layer is the input layer, followed by multiple hidden layers and finally there is an output layer. Each neuron receives an input from neurons in the previous layer and sends an output to the next layer. The connections between the neurons have weights associated with them to emphasize some connections more than others. This is illustrated with the width of the connections in figure 3. A neuron sums all inputs it receives (it may also have a pre-existing bias that is added as well), applies an activation function and returns that activation. The activation is then multiplied with the weight of the connection until the next neuron receives it as an input. If you want to learn more about artificial neurons I can recommend these sources [13] [14] [15]. A unique characteristic of DL compared to all ML algorithms is that DL can automatically learn and extract relevant features of the input data. For other ML tasks the scientist has to extract relevant features first before a ML model can be trained on them.

4. How to combine Supercomputers and Machine Learning

In section 2 we have discovered that Supercomputers are actually composed of multiple compute nodes in a fast interconnection network. Depending on the ML algorithm it might be wise to scale it beyond the boundaries of one compute node. Just like high performance applications require massive computational power and bandwidth large ML applications do as well. However, the algorithm must be dividable into smaller tasks that can each be deployed on a different node.

So called ensembles combine the predictions of multiple ML algorithms to increase quality of the overall prediction[16]. There are sequential and parallel ensemble methods. Though, only parallel ensembles can be efficiently distributed among different nodes. With sequential ensembles most nodes would idle while waiting for the input from the previous node.

There are 2 options to perform distributed ML on a Supercomputer. These options are illustrated in figure 4 and described hereinafter.

Figure 4: Model and Data Parallelism [17]

Model Parallelism: A deep neural networks is distributed among nodes, because the NN is too large to be calculate on a single compute node. The distributed parts of the NN need to communicate their results frequently. This leads to communication overhead and increased complexity for the developers.
Data Parallelism: Replicas of the same ML model are trained on disjunct subsets of a large dataset. The models communicate and synchronize their learning efforts through a server. Learning effort is a simplification in reality the models send fitted parameters and receive parameter gradients, if you want to learn more about the learning process of ML algorithms read this [18].

There are frameworks that abstract the complexity of the communication logic for distributed ML applications. This enables multiple magnitudes faster ML training. However, Supercomputers underly rigorous security restrictions and software might be provided by the system administrators. Therefore, consulting the administrators should be the first step. These communication frameworks are Horovod — invented by Uber — and SmartSim — invented by CrayLabs which was acquired by HPE (my employer) in 2019.

Horovod is focused on distributing ML workloads across GPUs on different nodes. SmartSim additionally enables the communication between classical HPC applications that are written in low-level languages and ML code in Python [19]. The Horovod documentation [20] doesn’t offer a demo application for beginners. However, this post [21] explains the setup with a demo application. The SmartSim documentation [22] also offers a great tutorials to write a beginner friendly distributed ML application.

I hope you enjoyed this post. Feel free to write a comment or follow me. This is my first post on medium, so if I made any mistake please let me know.

Picture references:

Figure 4 shows a combination of two pictures from the cited paper by Dean et al. The illustrations are replications of the original pictures with modifications and simplifications.
Laptop pictogram: https://www.flaticon.com/de/autoren/those-icons
Database pictogram: https://www.flaticon.com/de/autoren/phatplus
Compute node pictogram: https://www.flaticon.com/authors/creatype
CPU pictogram with text: https://www.flaticon.com/authors/hilmy-abiyyu-a
Memory pictogram: https://www.flaticon.com/authors/srip
GPU pictogram: https://www.flaticon.com/authors/good-ware
credit for all other pictograms used in my illustrations goes to Microsoft Powerpoint