ML Models on Petabytes of Data — You Need GPUs

Published in

Product AI

3 min readFeb 3, 2022

To understand how clients are using a product requires analytics on billions of records of usage data, often in the petabyte range. This post describes why GPUs are required to pull that off.

The Opportunity — Usage Data

A large-scale client facing application may have hundreds of users interacting with it, creating thousands of permutations of usage patterns that, on a daily basis, can easily generate hundreds of millions, if not billions, of complex unstructured data. This acquired data is what describes usage, everything from detailed server side API logs to comprehensive user interaction patterns that show what path a user traverses to get to a specific function.

All of this data, when mined for behavioral analytics, can provide you with deep and subtle insights into your users that can make a big difference. For example, micro-segmentation across billions of records will help you to understand how a very specific subset of clients are using the application. By improving their experiences, perhaps a new stream of revenue is in the cards.

The Challenge — Time

The challenge of exploiting opportunities, as described above, is time. The use of Artificial Intelligence and Machine Learning (AI/ML) has advanced to the point that we can create a seemingly limitless set of AI/ML algorithms that vary in type and complexity to get the job done. Our limitation then becomes, how fast can we do it? Speed requires power, and that often comes down to an enormous amount of CPU, which can be very expensive depending on the algorithms’ complexity and the time of day they are run. Traditional Python libraries, which enable us to write and implement great algorithms, simply don’t scale and so performing complex calculations on petabytes of data with traditional CPUs is incredibly inefficient (i.e., a slow going and sometimes personally frustrating process).

Imagine the Possibilities

Graphics Processing Units (GPUs) provide the solution to the challenge above and in an interesting way. While a large CPU might have 32 cores, a GPU can have thousands of cores optimized for processing the massive amounts of data described above. The use of GPUs to beef up processing power is not new, but what is interesting, and provides for more options, is the integration of software libraries. This is GPU-accelerated data analytics & ML.

For example, NVIDIA has developed a suite of libraries (under the name RAPIDS) that provides horizontally scalable machine learning via Python. That is the piece that’s missing in the challenge above. In the new scenario, you can now implement end-to-end data science and analytics pipelines entirely on GPUs.

Processing hundreds of billions of pieces of data sitting in a data lake now becomes reality with enterprise scale workloads executed in minutes and not days/hours. Imagine using a single server to process this data and rather than waiting around for results, you spend your time innovating. To make that a reality, you need GPUs.

ML Models on Petabytes of Data — You Need GPUs

The Opportunity — Usage Data

The Challenge — Time

Imagine the Possibilities

Written by Paul Lashmet