How We Build Code at Netflix
How does Netflix build code before it’s deployed to the cloud? While pieces of this story have been told in the past, we decided it was time we shared more details. In this post, we describe the tools and techniques used to go from source code to a deployed service serving movies and TV shows to more than 75 million global Netflix members.
The above diagram expands on a previous post announcing Spinnaker, our global continuous delivery platform. There are a number of steps that need to happen before a line of code makes it way into Spinnaker:
- Code is built and tested locally using Nebula
- Changes are committed to a central git repository
- A Jenkins job executes Nebula, which builds, tests, and packages the application for deployment
- Builds are “baked” into Amazon Machine Images
- Spinnaker pipelines are used to deploy and promote the code change
The rest of this post will explore the tools and processes used at each of these stages, as well as why we took this approach. We will close by sharing some of the challenges we are actively addressing. You can expect this to be the first of many posts detailing the tools and challenges of building and deploying code at Netflix.
Culture, Cloud, and Microservices
Before we dive into how we build code at Netflix, it’s important to highlight a few key elements that drive and shape the solutions we use: our culture, the cloud, and microservices.
The Netflix culture of freedom and responsibility empowers engineers to craft solutions using whatever tools they feel are best suited to the task. In our experience, for a tool to be widely accepted, it must be compelling, add tremendous value, and reduce the overall cognitive load for the majority of Netflix engineers. Teams have the freedom to implement alternative solutions, but they also take on additional responsibility for maintaining these solutions. Tools offered by centralized teams at Netflix are considered to be part of a “paved road”. Our focus today is solely on the paved road supported by Engineering Tools.
In addition, in 2008 Netflix began migrating our streaming service to AWS and converting our monolithic, datacenter-based Java application to cloud-based Java microservices. Our microservice architecture allows teams at Netflix to be loosely coupled, building and pushing changes at a speed they are comfortable with.
Naturally, the first step to deploying an application or service is building. We created Nebula, an opinionated set of plugins for the Gradle build system, to help with the heavy lifting around building applications. Gradle provides first-class support for building, testing, and packaging Java applications, which covers the majority of our code. Gradle was chosen because it was easy to write testable plugins, while reducing the size of a project’s build file. Nebula extends the robust build automation functionality provided by Gradle with a suite of open source plugins for dependency management, release management, packaging, and much more.
The above ‘build.gradle’ file represents the build definition for a simple Java application at Netflix. This project’s build declares a few Java dependencies as well as applying 4 Gradle plugins, 3 of which are either a part of Nebula or are internal configurations applied to Nebula plugins. The ‘nebula’ plugin is an internal-only Gradle plugin that provides convention and configuration necessary for integration with our infrastructure. The ‘nebula.dependency-lock’ plugin allows the project to generate a .lock file of the resolved dependency graph that can be versioned, enabling build repeatability. The ‘netflix.ospackage-tomcat’ plugin and the ospackage block will be touched on below.
With Nebula, we provide reusable and consistent build functionality, with the goal of reducing boilerplate in each application’s build file. A future techblog post will dive deeper into Nebula and the various features we’ve open sourced. For now, you can check out the Nebula website.
Once a line of code has been built and tested locally using Nebula, it is ready for continuous integration and deployment. The first step is to push the updated source code to a git repository. Teams are free to find a git workflow that works for them.
Once the change is committed, a Jenkins job is triggered. Our use of Jenkins for continuous integration has evolved over the years. We started with a single massive Jenkins master in our datacenter and have evolved to running 25 Jenkins masters in AWS. Jenkins is used throughout Netflix for a variety of automation tasks above just simple continuous integration.
A Jenkins job is configured to invoke Nebula to build, test and package the application code. If the repository being built is a library, Nebula will publish the .jar to our artifact repository. If the repository is an application, then the Nebula ospackage plugin will be executed. Using the Nebula ospackage (short for “operating system package”) plugin, an application’s build artifact will be bundled into either a Debian or RPM package, whose contents are defined via a simple Gradle-based DSL. Nebula will then publish the Debian file to a package repository where it will be available for the next stage of the process, “baking”.
Our deployment strategy is centered around the Immutable Server pattern. Live modification of instances is strongly discouraged in order to reduce configuration drift and ensure deployments are repeatable from source. Every deployment at Netflix begins with the creation of a new Amazon Machine Image, or AMI. To generate AMIs from source, we created “the Bakery”.
The Bakery exposes an API that facilitates the creation of AMIs globally. The Bakery API service then schedules the actual bake job on worker nodes that use Aminator to create the image. To trigger a bake, the user declares the package to be installed, as well the foundation image onto which the package is installed. That foundation image, or Base AMI, provides a Linux environment customized with the common conventions, tools, and services required for seamless integration with the greater Netflix ecosystem.
When a Jenkins job is successful, it typically triggers a Spinnaker pipeline. Spinnaker pipelines can be triggered by a Jenkins job or by a git commit. Spinnaker will read the operating system package generated by Nebula, and call the Bakery API to trigger a bake.
Once a bake is complete, Spinnaker makes the resultant AMI available for deployment to tens, hundreds, or thousands of instances. The same AMI is usable across multiple environments as Spinnaker exposes a runtime context to the instance which allows applications to self-configure at runtime. A successful bake will trigger the next stage of the Spinnaker pipeline, a deploy to the test environment. From here, teams will typically exercise the deployment using a battery of automated integration tests. The specifics of an application’s deployment pipeline becomes fairly custom from this point on. Teams will use Spinnaker to manage multi-region deployments, canary releases, red/black deployments and much more. Suffice to say that Spinnaker pipelines provide teams with immense flexibility to control how they deploy code.
The Road Ahead
Taken together, these tools enable a high degree of efficiency and automation. For example, it takes just 16 minutes to move our cloud resiliency and maintenance service, Janitor Monkey, from code check-in to a multi-region deployment.
That said, we are always looking to improve the developer experience and are constantly challenging ourselves to do it better, faster, and while making it easier.
One challenge we are actively addressing is how we manage binary dependencies at Netflix. Nebula provides tools focused on making Java dependency management easier. For instance, the Nebula dependency-lock plugin allows applications to resolve their complete binary dependency graph and produce a .lock file which can be versioned. The Nebula resolution rules plugin allows us to publish organization-wide dependency rules that impact all Nebula builds. These tools help make binary dependency management easier, but still fall short of reducing the pain to an acceptable level.
Another challenge we are working to address is bake time. It wasn’t long ago that 16-minutes from commit to deployment was a dream, but as other parts of the system have gotten faster, this now feels like an impediment to rapid innovation. From the Simian Army example deployment above, the bake process took 7 minutes or 44% of the total bake and deploy time. We have found the biggest drivers of bake time to be installing packages (including dependency resolution) and the AWS snapshot process itself.
Containers provide an interesting potential solution to the last two challenges and we are exploring how containers can help improve our current build, bake, and deploy experience. If we can provide a local container-based environment that closely mimics that of our cloud environments, we potentially reduce the amount of baking required during the development and test cycles, improving developer productivity and accelerating the overall development process. A container that can be deployed locally just as it would be in production without modification reduces cognitive load and allows our engineers to focus on solving problems and innovating rather than trying to determine if a bug is due to environmental differences.
You can expect future posts providing updates on how we are addressing these challenges. If these challenges sound exciting to you, come join the Engineering Tools team. You can check out our open jobs and apply today!
The journey to the cloud at Netflix began in 2008, and after seven years of diligent effort, we have finally completed…media.netflix.com
linguistic excellence, a great team environment, and cutting-edge technologymedium.com
Originally published at techblog.netflix.com on March 9, 2016.