Black Mirror: Bandersnatch is an interactive, choose-your-own-adventure show that follows the mind-bending drama of a video game programmer in the 1980’s. Netflix recently released this new episode of the tech dystopia series Black Mirror after spending a significant amount of time and energy building out new “state tracking” technology to incorporate interactivity into the experience.
In this post, I take a look at the underlying data structures and code that Netflix developed for this experience, exploring network calls and large JSON data structures to try and make sense of their madness.
Bandersnatch is a choose-your-own-adventure show about someone trying to make a choose-your-own-adventure video game based on a choose-your-own-adventure book. But the trick with choose-your-own-adventure stories is that they don’t follow linear storylines like traditional narratives, instead offering multiple story paths with different endings, all depending on the choices you make along the way. While non-linear storytelling has been part and parcel in the video game industry for decades, its implementation in a major streaming media platform is notable.
While watching Bandersnatch on Netflix, the viewer is presented with choices that pop up on the screen. Selecting one choice over another changes the video that you see as a new story segment unfolds.
Source (Credit: Netflix)
A number of industrious viewers on Reddit have attempted to map out the choices, branching storylines and endings for the Bandersnatch story, an example of which is shown below in a flowchart developed by Reddit user AppiusClaudius.
Source (Credit: AppiusClaudius)
The overall effect of providing choices and delivering a seamless video experience is impressive. So how did Netflix do it?
Interactive Entertainment at Netflix
Bandersnatch isn’t Netflix’s first foray into interactive media, it just happens to be the first one aimed at an adult audience. Last year, they premiered their first attempt at interactive media with a kid’s show episode of Puss in Boots: Trapped in an Epic Tale. And according to a Netflix help page, there are currently 5 interactive content titles with more on the way.
But it’s probably fair to say that Bandersnatch is Netflix’s most ambitious interactive title to date. So much so that they had to create new “state tracking” technology to handle the “millions of permutations of how you can play this story” according to Carla Engelbrecht, Netflix’s director of product innovation.
So what is this “state tracking” technology exactly and how does it work? To figure that out, I did what I always do when I’m curious about how websites do amazing things. I started digging into source code and network calls!
While watching Bandersnatch, I opened up the Chrome Developer Tools and just started watching all of the network calls. In particular, I was looking for anything that might have to do with interactivity or state-tracking functionality. Monitoring network calls while watching and making choices during the show, it looks like there are about 5 different request URL’s that are regularly being called.
(Credit: Netflix / Jon Engelsman)
The heavy lifter of content delivery are these GET requests to a
nflxvideo.net endpoint, returning back
application/octet-stream binaries of content. They are quickly and repeatedly returning back media data to make sure the viewer never experiences any buffering of video.
For the most part, these calls looks pretty standard to most Netflix content streams, so they likely don’t show anything unique that might help us figure out the interactivity.
Another set of network calls look a bit more promising, the Personalization calls. These POST requests submit a data-binary payload to a
/personalization/cl2 endpoint at the primary
The data-binary payload consists of a JSON data structure which includes a lot of expected tracking information (user ID, session ID, etc). But it also appears to include a ‘state’ object which might tell us something useful about interactivity.
MSL and Cadmium:
There are two network calls that send POST requests to a Netflix Message Security Layer (MSL) API.
The data payloads of these calls seem to include token and signature information about the user, but it’s not clear what is in the data binaries being sent. However, seeing as similar call are regularly sent during streaming of non-interactive media, it’s likely that these calls aren’t involved in any interactive functionality.
I should note that I’m glossing over a lot of detail in these network requests, specifically in regards to request/response headers, full data payloads, etc. I’m just pulling out enough detail to try and get a gist of what’s happening, but I encourage you to explore some of them on your own.
Last but not least on our list of network calls are a set of POST requests to an Netflix API named Shakti.
This network request is the one we’re interested in here, so let’s explore it in a bit more detail.
From the little information available publicly, it seems that the Shakti API is another data fetching service. The name “Shakti” is mentioned briefly in a 2014 slide from Ryan Anklam, then Senior UI Engineer at Netflix, discussing new Node.js services at the company.
Source (Credit: Ryan Anklam)
It’s also mentioned in a github repo of the same name from user HowardStark, however it appears to be more of an attempt at reverse engineering from the outside-in, rather than any official Netflix description of the service.
Ok, so we have somewhat of an idea what the Shakti API might be, but what else is going here? Well, looking back at the Shakti API calls, there appears to be a query parameter named
Shakti Call Paths
There’s one more important thing to note about the Shakti API network calls. While their URL’s look similar, there appear to be at least 8 distinct different types of Shakti API calls, distinguishable by their data-binaries. More specifically, by the values for a key named callPath in the data-binary JSON payload. Based on a cursory reading of the Falcor documentation, this payload is likely used by Falcor where the values for callPath route to different internal services.
I’m going to skip over the first 5 Shakti callPaths, but they’re listed below for reference.
- Reno / Lolomo:
- Dynamic Messages:
But wait, what’s this! At least 3 of the 8 callPath‘s used in Shakti requests have the word “interactive“ in them.
This looks promising. Not only are we seeing the word “interactive” but there’s also a “state” in there as well. We’ll come back to the two “log” callPath ‘s in a bit, but for now let’s check out the response of the interactiveVideoMoments callPath.
The interactiveVideoMoments callPath is 1 of 6 Shakti API requests that are called on page load, and it appears to only be called once. So it’s reasonable to assume this is some sort of initialization call.
But the most noticeable thing about the response data for the interactiveVideoMoments callPath is that its big. Really big. Prettifying the JSON structure, it’s over 25,000 lines long.
In copying the response data below, I’ve cut out most of the actual content in order to show the structure of the JSON Graph.
Within this JSON Graph, there seem to be at least 4 components that define the interactive components of Bandersnatch:
- stateHistory: initialization of 62 state variables (59 boolean and 3 multivariate)
- momentsBySegment: a list of video segments by type (scenes, impressions, post plays, etc) describing state preconditions, new state data and choices (also defined by segments).
- preconditions: a list of precondition definitions for each segment, defined by simple to complex conditional logic using the state variables
- segmentGroups: a list of how segment group ID’s and the segments that make them up, including precondition requirements
Let’s look at these some of components in more detail from the perspective of the first choice in Bandersnatch, where Stefan’s dad asks if he wants Sugar Puffs or Frosties cereal.
Source (Credit: Netflix)
Let’s start with the stateHistory list. At face value, this list of state parameters doesn’t tell us much. But we know it’s a state initialization of some kind, and since we know that Netflix created new “state tracking” technology for this interactive feature, I’m guessing it’s an important component! We’ll come back to this.
Moments By Segment
Next, scanning through the momentsBySegment list we come across a segment 1A that seems related to this first choice in the show, the one between Sugar Puffs (choice 1E) and Frosties (choice 1D).
"text": "SUGAR PUFFS"
There’s a lot to unpack here, but there are three things that stand out. For one, there are multiple definitions for start and end times in milliseconds. These values seem to define specific sections of video that relate to both the segment preceding a choice and the different choices being made. It’s also notable that segment 1A has only one set of definitions of a type “scene:cs_bs”.
We can also see that a userState array of impressionData looks similar to the stateHistory list of parameters. Since this data seems to be independent of which choice is selected, I’m guessing this state data is updated before a choice is even made, maybe at some point while the preceding video segment is playing.
Let’s take a look at the details for one of the two choices, segment 1E (aka Sugar Puffs):
"text": "THOMPSON TWINS"
"text": "NOW 2"
From segment 1A, we know that segment 1E is the choice for Sugar Puffs. Unlike segment 1A, we see two sets of definitions: one for a type “scene:cs_bs” (same as segment 1A) and one for a type “notification:playbackImpression”. The definition for this second type includes a precondition on the state p_sp and a userState update of that same state parameter. I’ll go out on a limb and claim that the “sp” in p_sp stands for “Sugar Puffs”.
The definitions for the “scene:cs_bs” include start and end timings, as well as data on the next choice at the end of this segment, specifically the choice of music tapes between Thompson Twins (segment 1H) and Now 2 (segment 1G).
It’s interesting to note that the data for segment 1D (Frosties) looks very similar to segment 1E, including the data for the subsequent choice between segments 1H and 1G, with the notable difference being that the state parameter p_sp is set to false (i.e. not Sugar Puffs).
We’re starting to see a bit of a trend here, so let’s summarize what we’ve found so far. From what we’ve seen, segment definitions can include:
- Start and end times related to some kind of video sequence
- Preconditions of state parameters related to playback impressions (whatever those might be)
- Updates to state parameters
- Choices (either 1 or 2) that can occur at some point in a segment, and the ID’s for segments related to those choices
- One or more types of definitions for each segment, including the following types: “notification:action”, “notification:playbackImpression”, “scene:cs_bs”, “scene:cs_bs_phone” and “scene:interstitialPostPlay_v2”
In regards to these segment definition types, there seem to be two notification types and three scene types. For the time being, I won’t go into detail on how these segment definitions can differ.
Our next component type is a precondition. Let’s jump ahead a bit in the narrative of Bandersnatch to explore this new component. At one point in the sequence, Stefan visits Dr. Haynes office and the viewer is faced with a choice for Stefan of Biting Nails or Pulling Earlobe.
Scanning through the momentsBySegment list for this choice, we find that it’s defined as segment 3R. Then, looking for that segment ID in the list of preconditions, we find this definition.
Seems simple enough. This segment has a precondition on only one state variable, the state p_vh (vh = Visit Haynes?). However, it’s not clear what this precondition does exactly. Is it related to a playback? Or a scene?
Looking at the long list of preconditions, it seems that they can range from simple logic expressions involving only one state to more complex expressions involving many states. Although we don’t know exactly how these preconditions are used, it’s fair to assume that they are static definitions that act as a flow control of sorts for the branching segment structure, opening and closing narrative pathways depending on the different state values.
Let’s stick with segment 3R for a second and take a look at our last set of components in our JSON Graph that seem important, segmentGroups. This component seems to be a way to organize the structure of how segments are connected. Some segmentGroups have a static set of segments making up a group, whereas others have dynamic definitions based on preconditions.
For example, segment 3R shows up in two different segmentGroups, VisitHaynesChoice and 3Q. The group VisitHaynesChoice is a collection of 6 different segment ID’s.
And the segmentGroup 3Q is a collection of 2 segment ID’s, where one of them (3S) appears to only be included in the group based on a precondition criteria labeled 3S_s3Q.
Looking at other segmentGroups, we see ones statically defined like VisitHaynesChoice, others dynamically defined with preconditions like 3Q, and even some that include other groups within a segmentGroup.
If this isn’t all confusing enough, it should be pointed out that ID names can exist as either specific segments, segmentGroups, preconditions or momentsBySegments, and can either exist as only one of these items or share the same ID across multiple items. So whereas the ID 3S_s3Q is just a precondition, the ID 3Q is both a segment and a segmentGroup.
Mapping It All Out
Mapping out the complete flow of all of these different segments, groups, preconditions and states is a daunting task. Way to try and make sense of all of this is to show how the logic and data structures above tie in to specific video segments.
const videoPlayer = netflix
// Getting player id
const playerSessionId = videoPlayer
const player = videoPlayer
Using our example of segment 3R from above, we can map out some of its related video clips using start and end times detailed in both the choicePoints and momentsBySegment components (shown in the image below).
(Credit: Jon Engelsman)
This shows how these data components (in JSON format) are defined and how they relate to specific points in the master video for this specific segment 3R. Looking at it another way, we can show how the state parameters and preconditions trigger different combinations of video segment types (scenes, playback impressions, etc) and their associated choices.
(Credit: Jon Engelsman)
From this, we can start to see how these components work together in mapping out the story flow. And how they’re used to jump around to different segments of the master video depending on which choices are made and the overall state of the interactive history, somehow all resulting in a seamless media experience.
OK, This Seems Complicated, Right?
Looking back at some of the flowcharts I’ve seen on Reddit, I have to wonder if they’re missing some narrative pathways, specifically because they seem to have a rudimentary account of all of the the 62 state parameters. A count of items in each of the main components we’ve explored show just how complex and large of a structure Bandersnatch appears to be under the surface:
- Preconditions: 241
- momentsBySegment: 208
- segmentGroups: 111
- stateHistory: 62
With so many segments and state parameters, there’s the potential for a lot of narrative variability depending on the complexity of preconditions for the various video segments.
To handle this complexity, we know that Netflix had to create a new piece of software called Branch Manager to build out the non-linear narrative. Black Mirror showrunner Charlie Booker describes the difficulty in mapping out Bandersnatch using just a flow chart:
You couldn’t do this in a flow chart because it’s dynamic and tracking what state you are in and doing things accordingly,” Brooker explained, who could “input and deliver his evolving script directly to Netflix” with the nifty tool.
Another interesting aspect of all of this is that Netflix somehow managed to implement these interactivity components within the context of their existing streaming infrastructure. They took a 5+ hour video, built a complex state/precondition/segmentation layer on top of it and then developed a process to jump back and forth to different points in the master video, all without any video buffering. The fact that they were able to do all of this relying on the same Shakti API services that they use for other Netflix content really speaks to the robustness and versatility of the services that they’re building.
Source Code and Akira
Ok, so we’ve seen how Netflix used their Shakti API to deliver initialization data that defines the entire interactive narrative and its video structure. But how does that all work together to actually manipulate the video that’s being watched and provide for a seamless media experience? Did they build something new to handle the video segmentation aspect of interactivity? To look into this, we need to check out some source code on the client side.
I couldn’t find any reference to this specific Netflix library online, so I’m not sure if it’s something new or something that Netflix has been using for a while. But looking at another episode of Black Mirror (a non-interactive episode), we see a similar Akira library being loaded via the request URL below:
Although the URL’s look similar, we see a long character string (maybe a hash?) that look slightly different in each of the two requests.
Unminifying the non-interactive Akira library shows just under 83,000 lines of code, about 6,000 fewer lines than the interactive version. And a cursory comparison of these libraries shows that the interactive Akira library has references to the four interactive components we’ve explored (preconditions, stateHistory, momentsBySegment and segmentGroups), while the non-interactive Akira library does not.
So it appears that this Akira library is handling most (or all) of the client-side functionality of Netflix’s “state tracking” technology, in theory updating state values, evaluating preconditions and jumping between video segments, all based on the data components loaded by the Shakti API.
Source (Credit: Netflix)
UPDATE: It turns out that Netflix has been using the Akira client for a while now! I found a 2015 article about Netflix interactions where then Director of Engineering Operations Josh Evans provides the following comment:
“We created what we call our ‘Darwin’ user interface, moving form vertical to horizontal box shots, and we tuned our algorithms … a lot of innovation went in,” says Evans.
The same kind of interface is on the website, too, creating what Evans calls the ‘Akira’ user interface.
“All the information you need is at your fingertips,” he says. It sounds simple, but it’s built on advanced telemetry, real-time analytics and advanced machine learning.
So while it seems that Netflix has been using the Akira user interface for a while now, I still think it’s important to note that they serve up two different versions of the client library, depending on whether the title is interactive or not.
Through the Looking-Glass
It’s clear that Netflix put a lot of effort into developing this interactive technology for Bandersnatch. To do this, they seem to have built out a narrative-specific JSON Graph, more than 6,000 lines of new client-side code and even a new Branch Manager tool to lay out the complex narrative structure of the episode.
This write-up is an attempt to make sense of some of that work, to explore and wrap my head around the details that went into the new streaming technology behind Bandersnatch. It’ll be interesting to see what new interactive titles Netflix might develop next, and whether or not they will continue to build out the technology and concepts explored in this post.
Thanks for reading!
Originally published at engelsjk.com on December 30, 2018.