Storage for AI in 2019

The joyous annual ritual of Making Predictions for the New Year

This blog from the desk of WekaIO CTO, Andy Watson.

As the CTO of WekaIO, I am exposed almost every day to conversations with people responsible for the infrastructure supporting ML (Machine Learning), typically because they have (or soon will be) deploying large numbers of GPU’s. If you only have a few, you will be compute-bound, but once you have many GPU’s the challenge becomes a Gordian Knot of I/O bottlenecks. Huge data sets — typically hundreds of TeraBytes growing rapidly into the multi-PetaByte range — are accessed randomly by thousands of devices each sitting on 100-GbE or InfiniBand pipes.

At WekaIO, we love this.

As I look over the horizon at 2019 and into the near edge of 2020, it is easy to predict that these data sets (of images, videos, IoT sensor readings, and so on) will be even larger. I could predict that researchers training AI will harness ever-greater numbers of GPU’s which are intrinsically more powerful, feasting insatiably on unprecedented quantities of data. But, I mean… Duh! What kind of insightful prediction is that?

What may not be obvious, though, is that there is a sea change sweeping across the AI field. DeepMind (a leader in ML, based in the UK and owned by Google/Alphabet) published a paper which is getting a lot of attention [“Relational inductive biases, deep learning, and graph networks”, https://arxiv.org/abs/1806.01261v3] and they are probably not the only smart people in AI who are coming to similar realizations. One takeaway for me is that the process of ML may become more freehand, allowing the software to affect selection criteria (via patterns of reasoning) for its learning pathways. From the perspective of infrastructure, my prediction for 2019 and beyond is that this will have consequences for the data set.

Within any large data set, most of us are familiar with the concept of the “working set” which is the current subset of data which is most active. Often the working set is dominated by data which is the most recent. Of all the many PetaBytes accumulated in a large ML data set, with this sea change it is about to become more difficult to anticipate a priori which will become members of any given working set. It may become appropriate to treat the whole thing as members of one or more working sets for various ML events in progress. Think about that: the whole thing!

At WekaIO, we will love this.

By 2018 ML at scale has already become an I/O problem so intimidating that it demands the best from the WekaIO Matrix™ file system software. Conveniently for us, that sets the bar too high for other competitive alternatives. And with this evolutionary step up as we enter 2019, even as data sets overall were already growing exponentially it will also move that bar even farther out of reach for every other file system when the whole entire data set will often effectively be the working set requiring the fastest tier of high performance storage. For WekaIO, this will be a slow pitch up the middle because our software benefits from increasing scale. The larger a WekaIO cluster becomes, the more robust it will be in every dimension — including performance. So, while I’ve been reading about this latest trend in ML research, I’ve been smiling.

But there is more to this than just more data. (Although that is admittedly the lens through which a WekaIO file system advocate cannot avoid seeing this emerging trend.)

The larger body of actively-accessed data is also going to be interpreted differently. One concept I like to think about is called (in the DeepMind paper) “combinatorial generalization.” In a nutshell, it implies that instead of interpreting the data in one pass, multiple passes are taken together. The outcomes will result in the AI getting better. Smarter. That is encouraging.

And lastly, there is an interesting technique also mentioned in the DeepMind paper — and already it is taking hold if the NVIDIA GTC Conference in Washington, DC, is any indication. I saw several presentations there who were using an emerging technique in which numerical analysis yields graphically plotted results on a chart of some kind. In the next step, the chart itself is subjected to a ML process of image interpretation. Some researchers have been doing this with whole bodies of data, generating many charts and using those resulting images of the charts to train their AI. It turns out that this can produce results that nobody has yet figured out a better way to obtain. Although I don’t know whether this will affect the I/O workload, I thought it was clever and wanted to mention it here. How does it affect my prediction for 2019? I expect that we will start seeing a great deal of data plotted out and fed back into ML systems, and that the result will be new insights we never expected.

And at WekaIO we love it, even if it doesn’t affect the I/O workload at all.