Remedial Data Science Engineering

Published in

In the weeds

6 min readMay 14, 2018

At Greenhouse, we have just launched our first customer facing Machine Learning feature: Greenhouse Predicts. Greenhouse has always supplied meaningful data to our customers with various reports and charting, so we are quite comfortable working with data. But machine learning is new to us and there turned out to be quite a few things we would need to learn along the way.

For GH Predicts, I was responsible for the tools and systems for deploying our machine learning predictive model. I thought that this would be a simple API server that would just require me learning a few libraries to interact with our model. As it turned out there were a lot of things that our team would have to contend with. This post in the first of a series of posts I am writing about those hard lessons. We are calling the series “Remedial Data Science Engineering”, to give you a sense of what we were in for.

Part 1 — Real Memory Constraints

Setup

Typically, ML-enabled predictive services like these require two steps: training and predicting. First we take a lot of data, run it through some algorithms to train a model. Second we put that model into some api server, feed it some data from customers and it spits out predictions.

The training step is an “offline” process. We do this on some periodic basis, and store the resulting model for later use. We are not overly concerned with performance since the job is run infrequently and can take an arbitrary amount of time. We settled on doing a monthly training, and storing the resulting model as a pickle file, a python file format for serializing generic objects. During initial development phase, our Data Scientist was training a small model on his laptop using a Jupyter notebook and a small subset of our total data to generate the model.

The second step, predicting, is the “online” part. We decided to build a small JSON API service, Prediction API. Prediction API would boot up, pull the model pickle file from S3, and then take simple API requests to make predictions. We settled to build simple Flask application, that connected to S3.

Most of my software development experience is with web applications, so when I think about application constraints, I am usually focused on the speed of requests, speed of database queries, and limiting the amount data needed to fulfill requests. Prediction API server didn’t connect to a database, so I focused on keeping the requests times small. Turns out the speed of predictions would be the least of my problems.

Not so smooth sailing

The problems we encountered began when we first started using production size data. When I was first building the server, the pickle file for the model was sizable but manageable at 87Mb. When we started to use all of our production data to train the model, the pickle file jumped to 19Gb. Let that sink in for a second. Our Prediction API server needed to hold in memory a 19Gb object. The laptop I develop on only has 16 Gb. This is a memory hog!

Our Infrastructure team uses Kubernetes along with a lot of custom tooling all running on top AWS 32Gb Memory servers to deploy all of our applications. Our model was 19Gb so in theory we should be able to boot one server with this new model. The kubernetes pod would take almost all resources of a server but it was possible, right? I configured the server to use all the memory possible pointed to the large model file on S3, booted it, and got…OOMed (Out of Memory). I took a big sigh and dug in.

The problem turned out to be with how we loaded the model. We were using boto3 to pull down the pickle file from S3, and then pickle to load the object (this is a simplified version of our code):

So following the logic of the code, we create an S3 object, read down the entire contents of file as binary data, store it in raw_data , and then use pickle.loads to build the object. That means that for a moment of time we were essentially storing the model twice: once as raw binary data and second as the actual object that we were building. That’s 38Gb of memory, of course we were going to OOM the box.

Given the size of this data, we could no longer naively download data willy-nilly. Fortunately the pickle library has the load method, which takes an IOStream like thing to build an object (it relies on duck-typing rather than checking the class of the object). Without diving too deep into the details, this allows pickle to read a chunk of bytes, build some part of the object, and discard the bytes (allow them to be garbage collected), then repeat the process until all the parts of the object is loaded. We could stream the data from S3 and not have to store the whole thing twice over.

Unfortunately, pickle.load requires an object which implements read and readline . The object returned by s3_object.get()['Body'] , a StreamingBody from the botocore library, does not implement those methods (Maybe one day Amazon will fix this…maybe). So this was going to require a little more hacking.

I dug through botocore’s repo and discovered StreamingBody uses a raw socket to pull down its data. Python has a pretty good IO library, including io.BufferedReader which is ideal for reading such a large file. So with a little hacking we could transform botocore’s StreamingBody to io.BufferedReader :

With this code in place, and confident in my success. I booted the server with the new code and got… OverflowError: signed integer is greater than maximum. Wait…what?

You had one job

What in the world was I doing that would cause an overflow error. Looking through the stack trace I noticed this C level call for reading bytes:

File "/Users/orion.delwaterman/.pyenv/versions/3.6.1/Python.framework/Versions/3.6/lib/python3.6/ssl.py", line 625, in read
    v = self._sslobj.read(len, buffer)

I then took a quick peek at the Python source code to find the function written in C to see this:

//https://github.com/python/cpython/blob/master/Modules/_ssl.c#L2361-L2364
static PyObject *_ssl__SSLSocket_read_impl(PySSLSocket *self, int len, int group_right_1, Py_buffer *buffer)

This is the C level method for reading bytes from SSL stream, and the second argument len is a signed integer. That means that pickle needed more than 2³¹-1 bytes to load a single primitive object (this turned out to be a particularly large numpy array within our probabilistic model with 5,342,077,759 bytes; yeah that’s more than 5 billion bytes). The way SSL was written it never anticipated a request that large.

I didn’t want to patch Python just to get around this issue with SSL, so I decided to write an intermediate buffer. This buffer would implement the same read method but would check the size requested. If it was greater than a signed integer, it would split the request into multiple chunks, download each chunk individually, and then join them back together at the end.

Yes…I wrote a buffer for a buffer:

This got the server booting, and we were able to start serving predictions using production models.

Now you know, and knowing is…

There would be more lessons to come while working on this project. We eventually found a way to shrink the model to a reasonable size, but that’s for another blog post. From this episode, I learned that you need production size data as soon as possible. Most of the libraries and tools web developers tend to work with are optimized for “normal” sized files. When working with large datasets, you are not always going to be able to rely on these tools. So be prepared to dig in to them, learn how they operate, and hack around when necessary.

Interested in working on these types of problems or about the company? Check out our job board; we are hiring https://grnh.se/jwi0xc