Large-Scale Data Science APIs at Rue La La

Cloud native and web scale

Stephen Harrison
Rue Gilt Groupe Tech Blog
11 min readMar 20, 2018

--

The Data Science team at Rue La La writes code that delivers a variety of recommendations and other data about our Members. We mostly run Apache Spark Scala code on the excellent Databricks platform and write results to AWS S3 buckets. So far, so good. Nothing out of the ordinary. But the best architecture for the last step — making large volumes of data available via an API to our web and mobile platforms—was not obvious to us. There are options: Let’s explore.

This story looks at the choices we considered, explains what worked and what didn’t, and shows the final design in detail. The result is a set of large-scale APIs that have good performance with boundless scalability. It’s mostly about how we call DynamoDB directly from API Gateway to avoid Lambdas, and why that’s a good idea.

Companion stories

There are more detailed descriptions of a couple of interesting things we did in this project in the posts below.

The upshot

At the highest level, we ended up with this architecture

The Big Picture
  • DynamoDB works well for storing 100,000,000+ items of data that can change frequently.
  • AWS API Gateway works well with a direct integration to DynamoDB, albeit with a couple of idiosyncrasies.
  • Custom importers are required to load data into DynamoDB at scale and AWS Batch proves an excellent way to run this code.
  • AWS CloudFormation served significantly better than the alternatives for bringing up repeatable stacks and CI/CD pipelines.

Background

Q3 2017 saw the migration of Rue La La’s production software from our data center to Amazon Web Services. This first phase focused on discovery of legacy systems and migrating those systems largely intact. We want to continue this path, but in a more cloud-native direction—we’re not afraid to push the envelope in this next stage. Cloud native for us means as far away as practical from running a commercial application server, on hardware we own, in a data center colocation we rent. There may be better definitions, but if you’re doing any of those things, you’re probably not cloud native.

For this project we started with the following:

  • The data is already available as output from our various data science jobs. We run Apache Spark Scala in Databricks and write data to large S3 files, each line of which is in the same JSON format.
  • 10M+ items of API data are updated at least once a day from data science code running in Databricks notebooks.

We have many Data Science API endpoints delivering different kinds of data about Rue La La Members, our flash-sale Boutiques, and products (or styles).

The example we’ll look at here is the API to deliver the brands that have a strong affinity to a given Member. We calculate this affinity using Collaborative Filtering and other techniques.

Requirements

  • Our web and mobile applications need to look up and render personalized Boutiques, brands, and styles for each Member when they log in. This is effectively a blocking call since although we can smooth out rendering the page while we’re retrieving the data, the recommendation section of the page is only actionable after they’re displayed.
  • The format of the data returned by the Data Science APIs may be different then the data written by Apache Spark. And in turn the data required to render pages or mobile app screens may be different again.
  • The API must be RESTful, secure enough, fast enough (average API call < 50ms, 95% < 150ms), and highly available (> 99.99%).

Options for storing data

We’ll meander around the architecture and technology, qualifying or discounting things as we go. This is probably not the exact path we took, but the result is the same. We’re just highlighting choices, failures, and reasoning.

API reads directly from S3

The data is already in Amazon’s S3. Can we deliver them from there via an API directly? We don’t have time to open a file on S3 and skip to the records we need. But perhaps we can read one of millions of small files with just the data we need. So what’s the quickest we can open, read, and unmarshal a file in S3 that contains a single JSON object? It’s about twice our total budget—300ms—in the best case. The other thing to note is that while S3 claims 11 9’s durability, it’s “designed for” 4 9’s availability. Practically speaking, we have witnessed several transient outages that have rendered otherwise highly available sites like GitHub unreachable, reportedly in part due to S3 outages. We did not feel it prudent to base our architecture on S3 availability.

We could do a few things to make this approach faster, including configuring API Gateway to cache results. But the services that deliver web pages and mobile data are already doing their own caching, and cascading caches always muddy the end-to-end TTL picture without careful initial implementation and strict control over maintaining TTL values. In any case, caching doesn’t help with the initial load from S3 and that takes too long.

Store the data in a database

Not such a radical idea. Amazon Web Services gives us a number of excellent options ranging from running our own database on an EC2 instance through managed services like AWS RDS running Oracle, PostgreSQL, and others, including running a managed NoSQL database, DynamoDB. We already have JSON document structure from the output of the data science Databricks jobs. So options like MongoDB and DynamoDB look appealing. We went with the AWS managed service, DynamoDB, and are pretty happy with the choice.

There are things you can’t do in DynamoDB that might be surprising at first, like quickly truncate a table to delete all the items. But you can always drop the table and recreate it. In practice we found that taking advantage of the feature that lets you set an expiration TTL on each DynamoDB item works well. That, combined with the fact that the common case completely replaces personalization data for a Member each day (and updates the TTL) avoids the need for explicit culling of old items. And no, you can’t postdate the expiration to delete items for DynamoDB immediately. It happens on a schedule we don’t control. Also, since updating the expiration TTL would cost a write anyway, you can’t come out ahead.

A note on updating existing DynamoDB items. The most efficient way to load data is to use thebatchWriteItem API call. But that bulk call can’t update existing items, only replace them. And then can only write small batches of 25 items at a time.

Options for accessing data

Ten years ago there’d be no discussion here other than which application server we could get good licensing terms for. It would quickly converge on something like SOAP endpoints in a JBoss cluster, with hardware load balancers, and so on.

But modern APIs are RESTful, JSON has replaced XML everywhere, and serverless is a thing now. By our definition above, this feels cloud native. So let’s go with that.

We tried AWS Lambda, but…

We originally tried using AWS Lambda with API Gateway as the access method for DynamoDB data. Almost all the example code shows how to do it this way. We implemented the Lambda using AWS Lab’s Serverless Java Container runtime, the Spark Java variant (confusingly, a different Spark, not Apache Spark) because we really liked its brevity. This worked very well indeed. It had low end-to-end latency because the two key integrations in the pipeline—API Gateway → AWS Lambda and AWS Lambda → DynamoDB—were both typically ≪10ms each, for a total latency of about 30–40ms.

There was just one issue with this design and that turned into a deal-breaker: the “cold start problem.” This had the effect of delivering API data fast almost all the time, but when it was slow it was 1.5 seconds or more.

In short, Lambdas run in reusable Docker containers. There are two cases where a supporting container can not be reused to execute a Lambda: scaling out requires new containers; and containers appear to be recycled every 10–15 minutes in any case. There are techniques for keeping Lambdas warm that send heartbeat requests, but this does not address the scale-out cold start. And if you think about it, there’s nothing to say the legitimate request isn’t the one that triggers the scaling out of a new Lambda instance, and thus be the one that gets the high latency.

In early testing with Lambdas we ran into cold start frequently enough that we not only noticed it, but could often predict it. We believe that would have resulted in a significant degradation in some Rue La La Members’ site experience. That’s a deal-breaker.

Direct API Gateway integration

We’re already using API Gateway to integrate with the Lambda. So why not try a direct integration with DynamoDB?

This is the solution we ended up using. It’s a bit quirky, but really interesting. See this companion story about it.

Importing data

Data is written to S3 buckets by Databricks jobs running Apache Spark Scala. These are files of JSON data partitioned into 100Mb+. There are other, more compact, storage formats like Apache Parquet with Snappy compression, but they’re harder to read in a standalone program. We also considered and tried the AWS EMR Connection for DynamoDB, but found it baked in too much knowledge about its environment, did not support the version of Apache Spark we’re using, and was fragile even when we got it working. Anecdotally, AWS’s Elastic Map Reduce (EMR) does not appear to be receiving regular maintenance updates and EMR Data Pipeline can be difficult to get right and operationalize.

In our architecture, an AWS Lambda receives S3 events for each new file and creates an AWS Batch job for the ones it cares about.

Lambda creates AWS Batch Jobs from S3 notifications

The batch jobs get queued and are allocated to existing AWS Batch clusters as resources become available. We configure clusters large enough to handle all import jobs simultaneously for a given environment, but that never happens in practice because data science jobs are naturally staggered. But good to know we can rerun everything at the same time in an emergency. Docker containers in AWS Batch start loading data from S3 to DynamoDB as soon there is capacity in the cluster. AWS Batch compute clusters are simply ECS clusters managed by AWS Batch.

AWS Batch jobs load JSON data into DynamoDB

We iterated the importer architecture a couple of times before we converged on a design we liked as follows:

  • The importer Docker image maps raw JSON from data science Databricks output into DynamoDB items. We are prepared to do some work here to make delivering the data from the API streamlined, specifically in the right format. This amounts to at least some field-name mapping and usually some small amount of data filtering and restructuring.
  • The Docker images run in 1Gb memory and only 1 CPU since they are completely I/O bound either waiting for data from S3 or write capacity on DynamoDB. This matches the CPU/memory ratio of many EC2 instance sizes available for running an AWS Batch cluster, including those we get by default with the special AWS Batch instance type optimal.
  • Scaling DynamoDB write capacity in the importer immediately before and after the import.
  • A custom InputStream for S3 files that does not make long-duration API calls to the S3 API to retrieve data. Instead, it implements the standard InputStream interface by performing lots of smaller reads on demand. See the appendix of this companion story for details.

Current state

We have a number of Data Science APIs in production and we’re adding to them regularly as new initiatives at Rue leverage the growing Data Science team. The pattern we established has proven very repeatable.

From a non-functional perspective, we’re happy with API security using built-in API keys. Consistently high throughput and low latency match requirements with room to spare. API Gateway’s request and response mapping is a little finicky but very fast once it’s correct. The ability to express a projection in DynamoDB requests means we can store sparse matrices of, for example, Member/brand affinity without loss of performance.

Production performance

Performance of API endpoints hovers around 18ms for simple cases, and up to 26ms for more complex DynamoDB projections. We assume HTTPS connection caching on the client if we’re to see client-side performance close to this. We see no issues handling thousands of concurrent requests, although read provisioning must be set in advance or allowed to auto-scale by ramping up load over a period of minutes.

A healthy Datadog API dashboard

Non-functional requirements

We looked at available options for securing access to our APIs. API Gateway endpoints in AWS are inherently public. We played with defining Cognito users for authorizing API access. But that was cumbersome to administer and we did not appreciate the extra 10ms round-trip. Issuing client certificates would have been very secure, but again hard to administer.

We ended up using API Keys, which have these nice features:

  • It’s simple to administer, including revoking access without a deployment.
  • It’s secure enough (keys are long, random strings).
  • Keys can be shared easily between related API Gateways in each QA and production environment via AWS API Gateway Usage Plans.
  • The REST API security is defined in Swagger to exactly match the HTTP header that API Gateway expects, x-api-key.
  • They are handled without measurable increased latency, we assume because valid keys are cached at the API Gateway.

In summary

Rue La La turned on detailed personalization for Members by writing data science jobs that include predicting a Member’s affinity to brands and categories. Data changes every day because we’re a flash-sale site and products are only on sale for a few days.

DynamoDB proved an excellent place to store this data: 10,000,000s of data points for Members, brands, and product categories changing every day.

We did not use an AWS Lambda to pull data from DynamoDB because we liked being able to integrate directly with API Gateway. We had to write a custom data importer that behaved predictably under load, where multithreading and failures were under tight control.

We like the result: It’s a repeatable pattern and pretty much trouble-free. Let us know your thoughts in the comments. What did we miss? How could it be better?

--

--