Things I learned designing an API for millions of concurrent TV viewers

My team supports a dozen client apps on different platforms and a legacy API that was designed to be a stepping stone. Our long term product requirements demand a full API replacement with breaking changes and a completely new set of client apps. Our launch date will only be feasible if the new API is designed and developed in parallel with the new client application development. Co-ordinating that across a half dozen client application teams while specifications are in flux is a huge challenge, and when we flip the switch day one, we are at pretty enormous scale, so it has to work.

Here’s what we learned.

API Versioning

Use subdomains instead of URLs to distinguish between API versions and always use CNAMEs. If your API is api.domain.com/v1/endpoint then when you roll out v2 if it does not run on the identical stack it is a lot more difficult and a waste of your time to make the new url api.domain.com/v2/endpoint point to the separate server stack. What we did was v2.api.domain.com/endpoint meaning we are always able to run older versions to support legacy users and all of our api versions are decoupled.

WebSockets

It turns out WebSockets can not only be cool but they actually can scale a lot better for live event data than straight HTTP where clients are constantly polling at short intervals. They do come with memory and networking costs so it isn’t free, but the overhead saved with a duplex stream of data vs http requests with fat headers is very helpful. For us we were already building on a Vert.x stack and that means we have cool things like WebSockets built in. What is really interesting is that for the sake of flexibility and to mitigate risk we are implementing an HTTP solution and a WebSocket solution in parallel.

In our case supporting WebSockets actually has a bigger implication for our client apps than for our servers. Our API, WebSockets or not follows RESTful patterns. Using HTTP with optional WebSockets means that clients request a resource at a URI that will either respond with data once (HTTP) or will send data intermittently as the resource data changes. All of our client apps then must have a callback to handle responses that can successfully be fired at any given point in time, asynchronously, whether it is just a single HTTP response or it is a WebSocket push. Beyond that it is no different than a typical REST architecture.

Dead Simple Routing and JWT Header Authentication

JSON Web Tokens (JWT) are an amazing design concept, that essentially allows every single client request to identify all data linked to a client by embedding a key in the header of every request. This means that things like device type that do not change within a session can be tracked as a reference to this token rather than as URL parameters, ultimately this simplifies the API design and makes documentation even easier.

Parallel Development Using a Mock Api

We had to start client development before we had designed our new API let alone before we had begun to develop the production V2 API because along with new API support the client apps had new designs to implement.

In some situations that might not be a big deal. To get started, client developers could just hand code some mock JSON to do local development and hook up to the servers later when they are ready. In our case however, as media applications with fully scoped browse, search, recommendations, and infinite scroll views, our client applications consume about 10k lines of JSON in a short user session. There was no way to hand code at enough scale to test out the concepts we were working on, particularly because our new API would include all of the client app layout information, navigation menus, content blocks, etc all to come through JSON.

So as a solution I kicked out a quick ExpressJS Node server with a simple REST API for a few of our planned endpoints where we toyed with the resource taxonomy and parameters. Then we needed to return some data and validate that data against our new UI designs to make sure the new data models covered all the bases needed for the designs.

Coming up with the sample data at scale was the hard part. We needed 25–30 rows of content per screen to test all the different content types that we had to model, and sourcing that by hand, in a way that makes any sense at all, was looking very difficult so I began systematically creating rules and methods to generate data. I would query the V1 API to get some content fields that were similar enough such as film titles and genres and some of the existing image fields. For all other fields that the new content model would have I would either hard code a default placeholder piece or use a nifty Node lorem ipsum generator to fill out strings of a certain length. For other fields I would just duplicate matching field types from the old API to at least get some content in there. For example, thumbnail images in the new API would be pulled from background images in the old API, long descriptions would be trimmed to short descriptions, and so forth.

This was great, but now we had thousands of lines of JSON to look at and it became a real headache to analyze it with just our imaginations to guess whether it would work in our dozen front end applications.

Enter the mock client app. A few hours later I had bootstrapped a companion AngularJS app, built on the BAMF Stack, that queried the Express app and drew the data roughly in the manner needed for the designs. Immediately this brought to our attention several places where logic could be streamlined and more modular, it also brought many keys further up into the content model structure so that apps could perform fewer if then checks and rely more on switches or directives.

Building a Kill Switch for Client App Builds

We’ve run into interesting situations caused by client apps being very difficult to roll back, particularly iOS. And so we came up with an API solution where we can disable specific builds in case there are issues and force users to upgrade.

Store a Lot of Information on the Server

Coming from a web first mentality this makes sense because by default you send everything from the server, so you don’t even think about how painless it is to update your code and fix bugs on the fly, etc. But when you are building Roku apps, Android TV, and iOS for example, you have to be really intentional about this. What we decided to do was to manage all of our client app layouts and static variables from the API. Given that these things don’t change terribly frequently it meant that we did not have too much logic to build on the server but could still give ourselves an intense amount of flexibility to change things across our apps without deploying new builds.

Documentation with Swagger in YAML

We’ve just started this and honestly if I go to make another API I might just start with a YAML api mocking tool like Apiary and get it out of the way as I build. Either way, from a design perspective I found that building sample apps for both the server side and the client side were extremely helpful in generating vast amounts of mock data, and validating the output against a real life use case. We’re using Swagger to build out standalone documentation for our half dozen developer teams and I think they’ll be very pleased when the rest of them get onboarded later this year.