HPC with a little help from my friends

Published in

IBM Data Science in Practice

5 min readFeb 18, 2021

one person helping another on a climb at the top of rocks with the sun low on the horizon

Have you ever ordered a meal only to be totally overwhelmed by a mountain of food? Or gone down to the hardware store and purchased something after eyeballing a measurement? Or heard the proverbial tale about the huge fish that got away — after almost sinking the boat.

Like it or not, humans aren’t always great at estimating. In the case of the restaurant, we can always ask for a doggy bag, to bring home the leftovers. For the hardware store, the item can be returned. And as for the fishing tale, well it comes down to credibility. In all of these cases there is potential waste — a waste of food, if we end up forgetting about the extra portion that we put into the refrigerator, a waste of materials if it was not cut to the right size, and a wasted opportunity to cook that big fish. And when it comes to high performance computing (HPC), things are no different. Organizations today use HPC as a competitive advantage for helping to bring better quality products to market faster. For this, large investments are made in HPC infrastructure, either on-prem or in the cloud. And it’s crucial
that organizations extract the maximum performance from their HPC by eliminating waste of computing resources.

Users of HPC are often some of the brightest minds in their field. Yet they may not be good judges of how much memory their jobs need. Modern HPC environments often seem to have limitless resources such as memory, cores and storage. But that all comes at a cost. Knowing this, users usually overestimate the memory needed their jobs. This is not done with malicious intentions, rather it’s about asking for enough to be on the “safe side”. After all, nobody wants to re-run their job after it’s failed due to insufficient memory. HPC job schedulers dutifully look for servers to satisfy the user’s request. From the users’ perspective all is well, but when we peel back the onion, we sometimes observe that the job uses much less memory than requested.

On the surface, this may not seem like a big deal. But imagine the scenario where a costly large memory cloud instance was used when a less costly resource would have been sufficient. Additionally, users may not be aware that their job actually held up other jobs which may have actually needed the large amount of memory requested. So users could unknowingly be impacting the jobs of their peers, which have been patiently waiting in the queue. The end result is waste. Waste of compute cycles, and in the case of cloud, waste of more costly resources. Factoring this across many users, this can have a significant effect on overall throughput in an HPC cluster and cloud expenditures. Poor utilization of HPC resources can decrease the ROI on infrastructure, negatively impact time to market and ultimately impact the bottom line of an organization. With detailed monitoring and oversight, HPC admins can spot these patterns and work to educate users. But this is time consuming and it’s all too easy for users to fall back on bad habits.

The forest from the trees

We’ve all heard the expression that data is power and HPC job schedulers provide a wealth of data. Going through this data takes skill and time and trying to identify patterns can be like looking for a needle in a haystack. In the extreme case, HPC clusters can be comprised of thousand of servers, serving hundreds of users, processing millions of jobs per day, and identifying cases where the requested memory greatly exceeds actual memory used by a job takes investigative skills. Admins are always in need of tools to help make their job of managing such large scale environments as efficient as possible. What if you could apply some intelligence using historical job data to help guide the scheduling of jobs, for example, in identifying patterns of overestimated memory requests and applying a corrective action? Or in simpler terms, a helping hand for your HPC environment.

The future is here

With over 28 years of experience in HPC resource and workload management, IBM has released new capabilities using AI to optimize performance and resource usage for IBM Spectrum LSF environments. IBM Spectrum LSF Predictor uses IBM AutoAI to create and train AI models for the prediction of memory use and job runtimes. This capability enables organizations to optimize HPC utilization by rightsizing the workloads, and better utilization improves time-to-results.

IBM Spectrum LSF Predictor features a GUI-driven wizard to create a prediction from selecting the data sources through to the training and deploying the model. It supports the creation of predictions on job memory usage requirements and job runtime prediction. Historical job data from Spectrum LSF are loaded into AutoAI automatically as part of the process. As part of the process, the user can select job attributes which are to be considered as part of the prediction target. These attributes include things such as the submitting user name, queue, and job group, just to name a few. Finally, it makes it easy to compare model results that lets a user determine which candidate model is the best to deploy in the Spectrum LSF environment.

The AI models created by Spectrum LSF Predictor can be used in three different ways. A passive mode records the requested and predicted resources. Advisory mode informs the user of the accuracy and effectiveness of their memory and runtime requests. Active mode modifies the memory and runtime requests to match the predictions.

two images — one with a messy packed box and the other with a well packed box

In life, we can always use a little help from our friends. Think of Spectrum LSF Predictor as a helping hand to your HPC environment — helping to make good decisions and keeping it on the straight and narrow. Learn more about IBM Spectrum LSF Predictor in this technical blog.

HPC with a little help from my friends

Written by Gábor Samu