Running out of RAM in Tensorflow?

Using Deephaven Tables to Back Large Tensorflow Datasets

Deephaven Data Labs
4 min readSep 22, 2020

By Matthew Runyon

Photo by NASA on Unsplash

Deephaven can be used to store and manipulate large amounts of data quickly and efficiently Some sets of data can exceed hundreds of gigabytes and the amount of memory a server would have. This could lead to some issues with analyzing your data with an external library such as Tensorflow.

In this article, we will cover how to utilize generators to use a Deephaven table that is too big to fit in memory with Tensorflow.

What is a Generator?

A generator is a function which yields a value of some sort and may be infinite. One benefit of generators is they can be used when memory is a concern and the values follow some pattern. Generators can be thought of as a function which is paused every time it yields a value until somebody asks the function for another value.

Conveniently, Tensorflow supports the use of generators as a data source in addition to loading a data source in memory or from a directory of files. The benefit of using a generator or directory of files is that the size of the data can exceed the amount of available memory without causing errors. Generators are a good choice since we can use the Deephaven Query Language to quickly filter our data and then access it from the table. One important thing to note is Tensorflow requires generators to return both the feature and labels for a set of data, so the syntax when training a network is slightly different.

Loading a Table in Chunks

In an earlier article, we covered how to utilize different libraries and data structures with Deephaven tables. From this, we can quickly convert a Deephaven table in Python to a Tensorflow tensor. The only problem is the data we care about in this article is too big to fit in memory, so we would just get an error that we ran out of memory if we tried to load the entire table. We could load the table 1 row at a time, but that would be extremely slow and not utilize the memory we have.

We can solve this by loading the table in chunks. Since generators can keep some state, we could just set a chunk size which we know will fit in memory and then loop until we run out of rows. This seems like a good solution except for one part: Tensorflow treats each yielded generator value as a batch and large batch sizes are bad for generalization of a neural network.

Mixing Caching with Small Batch Sizes

Since we want a small batch size, we could just set our chunk size to something like 32 rows. This would alleviate the problems caused by training with large batch sizes, but we again run into the issue of loading tiny amounts of data from our table being inefficient.

Instead, we can use our generator to cache a large section of our table in RAM while yielding small batches from the cached rows. With this we can get the speed benefit of loading millions of rows at a time without the drawback of training our neural network on giant batch sizes.

Combining these ideas and accounting for cases where our batch size and cache size do not align (we still want to return a full batch until the very last batch), then we end up with something along the lines of the following code:

This may look a bit daunting, but it follows the logic described so far:

  • We cache a large number of rows (defaults to 1 million),
  • yield small batch sizes from our generator (defaults to 32 which is the Tensorflow default),
  • and cache a new chunk after we yield all the rows in our current cache.

To use our generator, we should get our data into Deephaven tables for our feature values and labels as well as splitting for a test, train, and validation set. Below is an example of how we can use a generator for existing x_train_table and y_train_table variables in our Python environment. Remember, these variables are Deephaven table objects. We also already defined a model in Tensorflow which we stored as the variable model.

Wrapping Up

Deephaven is a great platform for storing and manipulating large amounts of data — so much data that sometimes you can’t fit it all in memory! Tensorflow benefits from large sets of data, so using Deephaven as the source of a big data set for Tensorflow is a natural connection. With the ideas and code shown in this article, you can easily use your giant Deephaven data sets to train a neural network in Tensorflow while keeping the benefits and speed of the Deephaven query language.

--

--

Deephaven Data Labs

Deephaven is a high-performance time-series query engine. Its full suite of API’s and intuitive UI make data analysis easy. Check out deephaven.io