Big data analytics and FPGAs

Philippos Papaphilippou
dunnhumby Science blog
3 min readNov 22, 2021

--

Everyone wants their machine learning models to run as fast as possible. Faster queries lead to many benefits, including more efficient hardware utilisation, and a better user experience.

For the past few years, my research at the EPSRC Centre for Doctoral Training in High Performance Embedded and Distributed Systems at Imperial College London has been in partnership with dunnhumby. And my particular focus has been on distinct counting — a common metric that is surprisingly computationally expensive — with the aim to allow train of thought analysis

But how can we achieve maximum performance on big data analytics using today’s chips?

General purpose computing using central processing units (CPUs) has proven to be a powerful tool for serial algorithms. Though, for naturally parallel problems, including dunnhumby’s customer analytics, specialised processing can sometimes yield considerably higher throughput.

Another platform for specialised computing is field-programmable gate arrays (FPGAs), which provide more flexibility than CPUs or even graphical processing units (GPUs), but are more costly to program and deploy. However, each of the aforementioned platforms has its own trade-offs.

In this article we briefly discuss how my PhD used FPGAs as an acceleration platform in the context of big data analytics.

There are many platforms for processing big data (CPUs (left), FPGAs (right))

One advantage of FPGAs is that they enable the development of arbitrary logical circuits, which can translate into fine-grain parallelism. For instance, one of the main topics of my PhD was parallel sorting algorithms, which consist of parallel compare-and-swap units (sorters of 2 elements) as a building block.

My research discovered a new sorting technique. The approach is a fast, lightweight parallel sorting algorithm called FLiMS. It improved on the previous best approach by significantly reducing the pipeline length, leading to improved latency and efficient use of the hardware.

FLiMS is a high-throughput parallel merging algorithm of two sorted lists. FLiMS can be used to accelerate merge sort by using it recursively, or even modified to implement sort-merge joins. In one of our experiments on an embedded platform (Ultra96), the FPGA FLiMS-based sorter achieved a speedup of over 50 times over std::sort() on the embedded ARM core, which is a promising result and indicative for what can be achieved with higher-end components. And this sorting approach can naturally be extended to distinct counting — including with large number of group bys — by sorting the dataset, and then removing duplicate rows which would be easy to identify as they would then be adjacent in the dataset.

Click here to see the novel merging algorithm in action

I would like to thank dunnhumby for funding this PhD and especially Chris Brooks and Rosie Prior for their involvement in the university partnership program, I was given the opportunity to explore a broader range of related emerging research topics.

As more cloud providers make FPGA instances available a future of accelerated machine learning might just be around the corner.

Note: edited by dunnhumby

--

--