#NetflixAndChill: How Netflix Scales with Node.js and Containers

It wasn’t too long ago that Netflix was just a DVD company. When Kim Trott, director of UI platform engineering at Netflix, started at the company about nine years ago, the company was just beginning to stream videos and had a catalog of only 50 titles. Now the company streams content around the globe and produces its own original programming.

As the company continues to scale, the UI Platform team is focused on building out the Netflix application to make sure it is as efficient and effective as possible before production begins. Given the large scale of Netflix, they’ve devised novel ways to make this happen.

In this enterprise interview (you can find the full video here), Mikeal Rogers, community manager at the Node.js Foundation talks with Yunong Xiao, platform architect at Netflix and Kim Trott about how Netflix is progressing with their containerization of its edge service layer (which allows the Netflix team to access data with all the hundreds of back-end microservices that exist within Netflix), other Node.js projects at Netflix, and how Netflix is giving back to the technology through working groups.

Mikeal: Hey, welcome to Enterprise Conversations with the Node.js Foundation. I’m Mikeal Rogers, the community manager. With us today are Kim Trott and Yunong Xiao from Netflix. Say hi.

Kim: Hi.

Yunong: Hey everyone.

Mikeal: Why don’t you tell us a little bit about yourself and your roles and team and what you all do at Netflix?

Kim: The team that I manage at Netflix is called the UI Platform team. We’re within the UI engineering part of the organization. Our team’s charter is really to help all the teams focused on building the Netflix application, to just make them more productive and efficient. This can cover a wide range of things. It could be building libraries that are shared across all of the teams that make it easier to do data access or client side logging. We write those libraries in JavaScript and Objective-C and Java, whatever target platforms that we have within the teams. Then we also work on the Node.js platform here at Netflix and build out things to make it easier to run Node.js applications and production for UI focused teams.

Yunong: To sort of segue nicely, that’s sort of my role at Netflix is to work on the Node.js platform. I help out with running Node at scale here. Right now we’re in the middle of a big re-architectural issue, we’ll get into later, about data access and Netflix. My predominate role here is to get Node.js up and running at scale and making sure it’s performing and debuggable and works well for us.

Kim: We also do consulting, helping teams with different problems that they have, coming up with solutions for those problems, helping them integrate new technologies that we’re adopting. For example, when we’re working on a new localization problem, we would help all of the teams adopt that new service, like if it was string management. We would the help team adopt that and pave the way for them to make sure it is running smoothly before the product focused team adopts it.

Mikeal: How long have you been at Netflix?

Kim: I’ve been at Netflix for 9 years in July. When I started, it was actually right when we launched the first streaming ever, so you could only watch on a Windows Media Player on a Windows machine. I think the catalog was maybe 50 titles, so very, very early days. We were primarily a DVD company at that point in time, so I’ve seen the evolution of Netflix going from DVD to streaming to now being our own content producer.

Yunong: I’ve been here for almost 2 years now. When I started, we had just had the 50 million party, which I just missed out on, but we’re fast approaching 100 million subscribers soon, so we’ll have a bigger party.

Kim: Hope so.

Yunong: Previously, I had worked at Joyent for a few years, working on Node.js and distributed systems, and then before that, I was at AWS, also working on distributed systems.

Mikeal: Cool, very cool. You gave a talk at Node.js Interactive (Kim), which was really well received. I think it’s our number one video, actually. In the talk, you mentioned that you’re doing some containerization of the Edge Services over at Netflix. Could you give us an update on how that’s moving?

Kim: Yeah, so since we spoke at Node Interactive, we’ve made a lot of progress on the project and we’re actually about to run a systems test and shadow traffic, so we’re going to start with shadow traffic and then run a full systems test where we actually put customer traffic (real production traffic) through the new Node container layer to sort of prove out the whole stack and flush out any problems, anything around scaling or memory. That’s really exciting. It’s been a lot of hard work over the last few months leading up to getting to this point, so the team’s really excited. Yunong can share some of the interesting technology innovations or ideas that we’ve put into this project that we’re really excited about.

Yunong: I’ll just give you a little background on this project. Today, when any device or client tries to access Netflix, they have to use what’s called Edge Services, which is a set of endpoint scripts written in Groovy that they write to a monolithic JVM based system that lets them access data with all the hundreds of backend microservices that exist in Netflix. That’s been working really well for us, but we are hitting some vertical scaling concerns. We thought, “Hey this is a great opportunity to leverage Node, which uses JavaScript.” A lot of our client engineers are really well versed at this and most of our UIs are in JavaScript, and Docker to be able to horizontally scale all of these data access scripts out.

This has been a really exciting project. One of the things I think we’re doing that we’re taking a pretty unique approach at is the way that we’re routing all of these data access scripts. Traditionally, today, everything goes inside of one model and the routings is not entirely inside of the JVM. So, we are breaking out all these hundreds of different data access scripts into their own individual Node.js apps, which means that we have to be able to manage these more discreetly.

One of the approaches we’ve taken is actually using semver as a way of routing to each of these individual apps. Now we version each app by semantic versioning and then we’re able to route based on that version. An individual client can ask for a minor version or a patch version or a major version, and just like the way it works with NPM packages, we’re able to route that request to the latest minor version or the latest major version in particular Node app inside of container. This is really novel because it helps us easily upgrade everything on our backend, just like how NPM works with Node packages, but help us maintain the same clients on the front end. A lot of our clients use an app store model, which means that they’re out there for a very long time and don’t ever really get updated, but we still want the flexibility to be able to update all of our data access applications running Node inside containers. That’s one of the novel things that we’ve done and we’re working on right now.

I’ve given a talk at Container Camp with a lot more details about this implementation, but we think that’s really novel. Along with this, obviously we’re also building out a registry, just like NPM registry, that is an index for all of these apps so that we can figure out which semver version of which app points to which Docker version. There’s been a very large cross-cutting project at Netflix involving not just our team, but many teams. This includes the Edge Services team, the tooling team… We have a team that’s working on all of the Docker container-based infrastructure as well. It’s a really exciting time to be at Netflix.

Kim: I think that’s really going to help us track the versions running in production, and when there’s need for, say, a security patch or some critical update, to be able to use the patch version semver semantics to target that and get that fix out there without needing to rev any client-side code, which is important to us, because every time we push a new client, that can create churn in terms of performance and other aspects and having to sometimes go through certification process for the app store model. We can move much more quickly, just updating on the server side, without having to impact the client side.

Mikeal: Brilliant. How has this impacted the developer productivity at Netflix? You mentioned that this allows you to move a lot quicker and not have to go through the whole app store process. You also mentioned a little something about this in your talk, I think, as well.

Kim: Yeah, so we are not at the stage yet of widespread adoption of this. As I mentioned, we’re just right now about to run next week the shadow traffic, which is essentially taking customer traffic coming through and then shadowing it against the new Node application stack and making sure that scales out. I think we have yet to realize some of the developer productivity benefits, but we do have folks who are using it in the early stages to build out the systems test that we’re running and so far I think the feedback on that’s been really positive.

The developer productivity really comes from breaking down the monolith into smaller pieces that now it’s much more manageable to be able to run that locally on your machine. Through the containerization, we can actually effectively guarantee that what you’re running locally will very closely mirror what you run in production. That’s really beneficial. Because of the way Node works, we can attack debuggers, set breakpoints and stuff through code.

In the past, if you wanted to debug these Groovy scripts, you would make some code changes, upload it to the Edge layer, run it, see that it breaks, make some more changes, upload it. There was REPL, but it was hard to pull your code out of your script and run it in the REPL. It’s a really challenging developer debugging and workflow. A lot of print statements and really awful things like that.

I think moving it so that it’s very easy to run locally, that’s going to be the huge boon to developer productivity is that you can iterate much more rapidly because there’s nothing that you have to deploy to the cloud to see your changes and to test things out.

Yunong: Yeah, so what used to take you, say, tens of minutes to test, you’re literally now running locally. I think a real testament to this project is all of our engineers who are working on all the clients are asking us, “When do we get to use this instead of the current legacy stacks?” That’s a really good testament to where we think this project will get us to.

Kim: Yeah, so we’re hoping to start with the first adopter of real big scale and production in Q3, so using Q2 to stabilize and harden everything and prove it out through the shadow traffic and running some systems tests where we only really run it long enough to get some learning. Then teams will start adopting it in Q3.

There’s folks who knock on our door regularly saying, “When is this coming? It can’t come soon enough.” I think there’s a lot of built up demand and excitement. We’re trying to move as fast as we can, but obviously we have a lot of customers in production that we don’t want to negatively impact, so we have to move carefully as well.

Mikeal: Other than running shadow traffic and finishing all of this out, is there anything else in the future road map over at Netflix with Node.js?

Kim: There’s a lot more stuff that is coming. I think one of the things that we’re really excited about is after moving past building out this stack, to start working on some of the tooling and performance related stuff, so better tools for postmortem debugging is something that we’re really passionate about and really want to be involved in the working groups and help contribute back to the community, so that we can build better tools that everyone can leverage.

Yunong: We’re also sort of working … Again, I think one of the reasons why Node.js is so popular is actually the fact that it’s got a really solid suite of tools that’ll let you debug when your process goes faulty, and so that’s something that we’re actively working and contributing on. Like Kim said, we’re on a lot of these working groups and we’d love to see these tools get better and advance in the next few months and contribute some of these tools as well with our own time.

The other things that we’re working on, which I guess is part of the container of the Edge Services project, there’s something called ReactiveSocket, which is a new multiplex, duplex protocol that we built a version of in JavaScript and on Node, and that’s something that is open source today that we’re using as part of our Edge re-architecture. That’s something that we’re pretty excited to work with as well, as it has huge performance improvements already existing HTTP Stack. Those are some of the things that we’re working on.

Kim: Yeah, and I think with ReactiveSocket, it unlocks some new use cases for us. Because we’re taking that model where we have the scripts running at the same tier as the API service layer, so this aggregation of the hundreds of microservices at Netflix, because we’re separating that out, we now make remote calls to the Edge layer (we’re now calling it the remote service layer) to get all of that aggregated metadata back to the Node layer where we do translation aggregation and UI and client specific logic. Breaking that apart, there’s been a little concern about we’re introducing another hop into the overall process, and what are the implications of that going to be? Is it going to affect performance? Are we going to have higher air rates or latencies or problems with that?

So, one of the interesting things that we get to leverage with ReactiveSocket is a new interactive using channels where we can keep a connection open and start sending data back and forth. If we go to the Edge for some data and then we realize on processing that data that we need to go get more data, we can do that very efficiently because we can leave that connection open, and then on the Edge side they can even keep around some of their in-memory caches so we can make these communications really efficient, even though we’re breaking apart those tiers. That’s a really exciting aspect to ReactiveSocket that we haven’t fully tapped into yet.

Mikeal: Wow, cool. That’s all that I got for you today. This was great talking with you, and I really appreciate you taking the time to talk with the Foundation.