Rich Data Science APIs at Rue La La

How we store and retrieve large volume Data Science data at Rue La La

Stephen Harrison

Published in

Rue Gilt Groupe Tech Blog

6 min readMar 23, 2018

The main story

This article is about one aspect of a larger project we did at Rue La La. See the following for the big picture.

Large-Scale Data Science APIs at Rue La La - Stephen Harrison - Medium

The Data Science team at Rue La La writes code that delivers a variety of recommendation and other data. We mostly run…

medium.com

The upshot

We call DynamoDB directly from API Gateway to support our data science APIs without an interstitial Lambda. Using a Lambda to map API requests and responses to a database has undesirable performance characteristics in some cases relating to the cold start problem.

DynamoDB’s storage format exposes type information and that wasn’t suitable to pass on to applications consuming the Data Science API. The API Gateway response mapping is an unforgiving place to parse out this type information, so we end up storing JSON blobs and returning those directly. It’s brilliantly fast but a bit quirky.

We store sparse matrices of 20,000,000,000+ elements in DynamoDB for recommendation APIs. We’ll show you our solution.

Requests and responses

You use Velocity templates to map requests and responses in API Gateway, unlikely to be anyone’s favorite programming environment. So you have to write Velocity code to parse out the exposed DynamoDB type information if you go that route.

The template context contains many details of the request, so you can use values from that context when you construct the mapped request and response. In addition to the context, a few simple utility functions are provided to manipulate JSON, URL, and base64 encodings.

Here’s a sample template to map a request in API Gateway to the DynamoDB GetItem REST endpoint. The URL endpoint in question is /v1/members/{member_id}/brands, which returns the list of brands we recommend for a member with a summary of the recommended styles in each. You can see how we get the name of the DynamoDB table from API Gateway stage variables and member_id from the request parameters, a path component in this case. member_id is defined as the HASH attribute on the DynamoDB table.

API Gateway request mapping

The table name is maintained in a stage variable rather than being hard-coded. It might be overkill, but the thinking here is that it allows us to switch it out in an emergency more easily. We’d be able to do that without redeploying, just updating the API Gateway configuration via the console or AWS CLI. We’ve never had to do that, but good to know.

The response mapping has to do a couple of things with the reply from DynamoDB. Disappointingly, DynamoDB does not return an HTTP 4xx status when an item is not found: it always returns a HTTP 200 with an empty response instead. So we have to check whether the response is empty and return valid JSON string []. Furthermore, API Gateway does not let you map HTTP status codes dynamically either. You’ll see the Velocity for this below. Fair warning: it’s pretty nasty.

Storage layout

The data we want to store looks something like

Sample data science input for a simple API

It’s a list of brand details: A member’s affinity to the brand with sample styles in that brand so we can display them on the site or mobile application.

However, you can’t store it that compactly in DynamoDB without some trickery. First, let’s see how DynamoDB stores non-trivial document structure. "L" for lists, "N" for numbers, "S" for strings, and so on.

Lots of type information to sift through in DynamoDB responses

You would need some significant JSONPath tomfoolery to extract just the list of styles we want, in this case["100123","100635","100845"]. It’s possible, just a bit hard to get right and maintain.

Instead, let’s take advantage of the fact we always return the same brand data and overview styles intact without manipulation. At least in limited windows. So we store that whole structure as a JSON quoted string, and just unescape it in the API Gateway response mapping template on the way out. That means the DynamoDB Item for the above looks like

DynamoDB storage format with a JSON literal

Now we can reference the brand affinity string from the DynamoDB Item with the JSONPath$.Item.brands.S using the provided utility $input.json() in the API Gateway response mapping and parse the escaped JSON with $util.parseJson().

This is how we currently do it in production. We feel the consistent speed of the direct integration outweighs the inscrutability of the resulting Velocity templates. Just look at this example.

API Gateway Velocity response mapping for the above JSON format: sometimes ugly just works

So while it’s a bit of a hack, we have to admit it’s compact, fast, and above all it works. The only JSON processing we’re doing is via a fixed JSONPath in the API Gateway response mapping, so there’s very little manipulation involved.

API Gateway request and response mapping has many idiosyncrasies, but not working is not one of them.

Complex data

The example above returns an overview of brands we think a given member likes at a point in time. This synopsis contains just a few styles for each brand and is used to render graphics in the mobile and web applications. It’s small enough we can return all the top brands and their summary styles in one API call.

There’s a companion API call, which returns all the styles for a given member and brand. There are often 1,000’s of styles on sale at any one time for a given brand, so returning all the brands and all their styles at once just to pull out the brands you want is impractical. The URL endpoint for the detail call is /v1/members/{member_id}/brands/{brand_id}.

The obvious approach was too expensive

Our original design for this API endpoint wrote items to DynamoDB with a HASH of member_id and RANGE of brand_id.

These tables had 100,000,000+ items, one for each member_id/brand_id combination. The lookup endpoint was still very fast. But while we can technically import that many rows daily, we were running into significant throttling bottlenecks waiting for DynamoDB write capacity to scale up that high. In addition, 100,000,000 writes a day is > $1,000 a month. We have dozens of APIs, so that approach raised some eyebrows.

A sparse matrix

Instead, we switched our representation to make the brand_ids separate columns in each member’s DynamoDB Item, in effect creating a sparse matrix of members and brands.

We were concerned that having 10,000’s columns in a DynamoDB table—a vast majority of which were empty for each of millions of rows—would create some performance issues.

But our fears were completely unfounded. It works very well. Let’s take a look at how we organize the data in DynamoDB Items and how we modify the API Gateway response mapping template to pull this off.

Storing sparse data

We store the member_id as before, but now with columns for each brand_id.

Storing sparse columns works extremely well

We’re still storing the styles for a member’s brand as a JSON blob for fast retrieval and to simplify the response mapping.

Enhanced request mapping

Finally, we enhanced the request template in API Gateway to pass a projection of the brand column in the GetItem DynamoDB request so the return result from DynamoDB is limited to the styles for a specific brand. This means we’re not transferring more data than we need from DynamoDB to the API Gateway. In addition, we don’t have to do anything fancy in the response mapping to filter brands. These mapping templates are already plenty complex enough, so that’s a welcome simplification.

Here’s the updated request mapping template. The response temple remains essentially the same ugly mess as before.

Provide a projection for the request to DynamoDB

We could not measure any significant performance penalty making this more complex call to DynamoDB, but we saw dramatic cost improvements because our import jobs were considerably less resource hungry.

In summary

We can use DynamoDB to store a 20,000,000,000+ element sparse matrix. We can still use the API Gateway request and response mapping templates to retrieve the data and so we can still omit the Lambda between API Gateway and DynamoDB.

Did we miss something? Is there a better way? Please let us know.