NetflixGraph Metadata Library: An Optimization Case Study
by Drew Koszewnik
Here at Netflix, we serve more than 30 million subscribers across over 40 countries. These users collectively generate billions of requests per day, most of which require metadata about our videos. Each call we receive can potentially cull through thousands of video metadata attributes. In order to minimize the latency at which this data can be accessed, we generally store most of it directly in RAM on the servers responsible for servicing live traffic.
We have two main applications that package and deliver this data to the servers which enable all of our user experiences — from playing movies on the thousands of devices we support to just checking out our selection with their cell phones:
- VMS, our Video Metadata Platform, and
- NetflixGraph, which contains data representable as a directed graph
VMS is responsible for packaging data about videos such as synopses, titles, as well as data about video artwork and streams. NetflixGraph contains data about relationships between entities like videos, characters, and tags (e.g. gritty, quirky, funny). This data enables the highly personalized service our users enjoy. Remember when you let Netflix know that you enjoyed Exciting movies? Fantasy movies? Remember when you enjoyed watching Spider-Man? That’s why we decided to recommend Captain America to you. In the directed graph NetflixGraph represents, Captain America is connected to many of the things we’ve discovered explicitly and implicitly about you.
This article specifically details how we achieved a 90% reduction in the memory footprint of NetflixGraph. The results of this work will be open-sourced in the coming months.
We constantly need to be aware of the memory footprints on our servers at Netflix. NetflixGraph presented a great opportunity for experimentation with reduction of memory footprints. The lessons and techniques we learned from this exercise have had a positive impact towards other applications within Netflix and, we hope, can have applications outside of Netflix as well.
The first step in the optimization of any specific resource is to become familiar with the biggest consumers of that resource. After all, it wouldn’t make much sense to shrink a data structure that consumes a negligible amount of memory; it wouldn’t be an optimal use of engineering time.
We started by creating a small test application which loaded only sample NetflixGraph data, then we took a heap dump of that running process. A histogram from this dump (shown below in Eclipse Memory Analyzer) shows us the types of objects which are the largest memory consumers:
From this histogram, we can clearly see that
HashMapEntry objects and arrays of
HashMapEntry objects are our largest consumers by far. In fact, these structural elements themselves consumed about 83% of our total memory footprint. Upon inspection of the code, the reason for this finding was not surprising. The relationships between objects in our directed graph were generally represented with
HashSets of “to” objects were keyed by “from” objects. For example, the set of genres to which a video belongs would have been represented with a
HashMap<Video, HashSet<Genre>>. In this map, the Video object representing “Captain America” might have been the key for a Set containing the Genres “Action”, “Adventure”, “Comic Books & Superheroes”, and maybe, in typical Netflix fashion, the very specific “Action & Adventure Comic Book Movies Depicting Scenes from World War II”.
Solution: Compact Encoded Data Representation
We knew that we could hold the same data in a more memory-efficient way. We created a library to represent directed-graph data, which we could then overlay with the specific schema we needed.
We start by numbering each of our unique objects from 1 to n. The value that each object gets assigned we refer to as an “ordinal”. Once each object is numbered, we need a data structure which will maintain the relationships between ordinals. Let’s take an example: the figure below represents a few nodes in our graph which have been assigned ordinals and which are connected to each other by some property.
Internal to the graph data structure, we refer to each object only by its assigned ordinal. This way, we avoid using expensive 64-bit object references. Instead, the objects to which another object is connected can be represented by just a list of integers (ordinals). In the above diagram, we can see that the node which was assigned the ordinal “2” is connected to nodes 3, 5, and 7. These connections are of course fully representable by just the list of integers [ 3, 5, 7 ].
Our data structure maintains two arrays. One is an integer array, and the other is a byte array. Each object’s connections are encoded into the byte array as delta-encoded variable-byte integers (more on this in the next paragraph). The integer array contains offsets into the byte array, such that the connections for the object represented by some ordinal are encoded starting at the byte indicated by offsetArray[ordinal].
Variable-byte encoding is a way to represent integers in a variable number of bytes, whereby smaller values are represented in fewer bytes. An excellent explanation is available here on Wikipedia. Because smaller values can be represented in fewer bytes, we benefit significantly if we can represent our connected ordinals with smaller values. If we sort the ordinals for some connection set in ascending order, we might represent each connection not by it’s actual value, but by the difference between it’s value and the previous value in the set. For example, if we had some ordinals [1, 2, 3, 5, 7, 11, 13], we would represent this with the values [1, 1, 1, 2, 2, 4, 2].
Of course, there’s more to our data than just nodes and connections. Each node is typed (e.g. Video, Genre, Character), and each type has a different set of properties. For example, a video may belong to several genres, which is one type of connection. But it also may depict several characters, which is a different type of connection.
In order to represent these different types of nodes, and different properties for each node type, we define a schema. The schema tells us about each of our node types. For each node type, it also enumerates which properties are available to get a set of connections for.
When all connections for a node are encoded, the connection grouping for each of its properties are appended to the byte array in the order which they appear in the schema. Each group of integers representing these connections is preceded by a single integer, indicating the total number of bytes used to encode that property. In our earlier example, since each of the values [1, 1, 1, 2, 2, 4, 2] are representable with a single byte, this grouping would be preceded by the encoded value “7”, indicating that seven bytes are used to represent the connections for this property. This allows us to iteratively read how many bytes are in a given property, then skip that many bytes if we are not interested in that property.
At runtime, when we need to find the set of connections over some property for a given object, we go through the following steps:
- find the object’s ordinal.
- look up the pointer into our byte array for this object.
- find the first property for the node type of this object in our schema.
- while the current property is not the property where interested in: 4a. read how many bytes are used to represent this property. — 4b. increment our pointer by the value discovered in (4a).
- move to the next property in the schema.
- iteratively decode values from the byte array, each time adding the current value to the previous value.
- look up the connected objects by the returned ordinals.
When we dropped this new data structure in the existing NetflixGraph library, our memory footprint was reduced by 90%. A histogram of our test application from above, loading the exact same set of data, now looks like the following:
When to consider this solution
There is a potential disadvantage to this approach. In addition to memory footprint on our edge servers, another thing we constantly need to be cognizant of is CPU utilization. When we represented connections as
HashSets, determining whether an object is connected to another object was an O(1) operation. To ask this question in the new way, our data structure requires an iteration over all values in the set, which is an O(n) operation.
Luckily, the vast majority of our access patterns for this data are full iterations over the sets, which are no slower now than they were in the previous implementation. In addition, the engineers for each of the teams responsible for maintaining our edge services are extremely vigilant, and our resource utilization is heavily scrutinized with sophisticated tools on each new deployment.
This article has discussed one of the approaches we took for compacting directed graph data in the context of one of our more recognizable data sets — deeply personalized genres. Any application with data sets which lend themselves to representation as directed graphs can benefit from this specific optimization. We will be open-sourcing the memory optimized graph component of this library in the coming months. Stay tuned!
By the way, if you’re interested in working with the amazing group of engineers who solve the scalability problems Netflix faces on a daily basis, we are looking for both a software and automation engineer on the Product Infrastructure team. At Netflix, you’ll be working with some of the most talented engineering teammates in the world. Visit http://jobs.netflix.com to get started!
Originally published at techblog.netflix.com on January 18, 2013.