Netflix at AWS re:Invent 2015

Ever since AWS started the re:Invent conference, Netflix has actively participated each and every year. This year is no exception, and we’re planning on presenting at 8 different sessions. The topics span the domains of availability, engineering velocity, security, real-time analytics, big data, operations, cost management, and efficiency all at web scale.

In the past, our sessions have received a lot of interest, so we wanted to share the schedule in advance, and provide a summary of the topics and how they might be relevant to you and your company. Please join us at re:Invent if you’re attending. We have linked the slides and videos to this same post below.

ISM301 — Engineering Global Operations in the Cloud
Wednesday, Oct 7, 11:00AM — Palazzo N
Josh Evans, Director of Operations Engineering

Abstract: Operating a massively scalable, constantly changing, distributed global service is a daunting task. We innovate at breakneck speed to attract new customers and stay ahead of the competition. This means more features, more experiments, more deployments, more engineers making changes in production environments, and ever increasing complexity. Simultaneously improving service availability and accelerating rate of change seems impossible on the surface. At Netflix, Operations Engineering is both a technical and organizational construct designed to accomplish just that by integrating disciplines like continuous delivery, fault-injection, regional traffic management, crisis response, best practice automation, and real-time analytics. In this talk, designed for technical leaders seeking a path to operational excellence, we’ll explore these disciplines in depth and how they integrate and create competitive advantages.

ISM309 — Efficient Innovation — High Velocity Cost Management at Netflix
Wednesday, Oct 7, 2:45PM — Palazzo C
Andrew Park, Manager FPNA

Abstract: At many high growth companies, staying at the bleeding edge of innovation and maintaining the highest level of availability often sideline financial efficiency goals. This problem is exacerbated in a micro-service environment where decentralized engineering teams can spin up thousands of instances at a moment’s notice, with no governing body tracking financial or operational budgets. But instead of allowing costs to spin out of control causing senior leaders to have a “knee-jerk” reaction to rein in costs, there are proactive and reactive initiatives one can pursue to replace high velocity cost with efficient innovation. Primarily, these initiatives revolve around developing a positive cost-conscious culture and assigning the responsibility of efficiency to the appropriate business owners.

At Netflix, our Finance and Operations Engineering teams bear that responsibility to ensure the rate of innovation is not only fast, but also efficient. In the following presentation, we’ll cover the building blocks of AWS cost management and discuss the best practices used at Netflix.

BDT318 — Netflix Keystone — How Netflix handles Data Streams up to 8 Million events per second
Wednesday, Oct 7, 2:45PM — San Polo 3501B
Peter Bakas, Director of Event and Data Pipelines

Abstract: In this talk, we will provide an overview of Keystone — Netflix’s new Data Pipeline. We will cover our migration from Suro to Keystone — including the reasons behind the transition and the challenges of zero loss to the over 400 billion events we process daily. We will discuss in detail how we deploy, operate and scale Kafka, Samza, Docker and Apache Mesos in AWS to manage 8 million events & 17 GB per second during peak.

DVO203 — A Day in the Life of a Netflix Engineer using 37% of the Internet
Wednesday, Oct 7, 4:15PM — Venetian H
Dave Hahn, Senior Systems Engineer & AWS Liaison

Abstract: Netflix is a large and ever-changing ecosystem made up of:

  • hundreds of production changes every hour
  • thousands of micro services
  • tens of thousands of instances
  • millions of concurrent customers
  • billions of metrics every minute

And I’m the guy with the pager.

An in-the-trenches look at what operating at Netflix scale in the cloud is really like. How Netflix views the velocity of innovation, expected failures, high availability, engineer responsibility, and obsessing over the quality of the customer experience. Why Freedom & Responsibility key, trust is required, and why chaos is your friend.

SPOT302 — Availability: The New Kind of Innovator’s Dilemma
Wednesday, Oct 7, 4:15PM — Marcello 4501B
Coburn Watson, Director of Reliability and Performance Engineering

Abstract: Successful companies, while focusing on their current customers’ needs, often fail to embrace disruptive technologies and business models. This phenomenon, known as the “Innovator’s Dilemma,” eventually leads to many companies’ downfall and is especially relevant in the fast-paced world of online services. In order to protect its leading position and grow its share of the highly competitive global digital streaming market, Netflix has to continuously increase the pace of innovation by constantly refining recommendation algorithms and adding new product features, while maintaining a high level of service uptime. The Netflix streaming platform consists of hundreds of microservices that are constantly evolving, and even the smallest production change may cause a cascading failure that can bring the entire service down. We face a new kind of Innovator’s Dilemma, where product changes may not only disrupt the business model but also cause production outages that deny customers service access. This talk will describe various architectural, operational and organizational changes adopted by Netflix in order to reconcile rapid innovation with service availability.

BDT207 — Real-Time Analytics In Service of Self-Healing Ecosystems
Wednesday, Oct 7, 4:15PM — Lido 3001B
Roy Rappoport, Manager of Insight Engineering
Chris Sanden, Senior Analytics Engineer

Abstract: Netflix strives to provide an amazing experience to each member. To accomplish this, we need to maintain very high availability across our systems. However, at a certain scale humans can no longer scale their ability to monitor the status of all systems, making it critical for us to build tools and platforms that can automatically monitor our production environments and make intelligent real-time operational decisions to remedy the problems they identify.

In this talk, we’ll discuss how Netflix uses data mining and machine learning techniques to automate decisions in real-time with the goal of supporting operational availability, reliability, and consistency. We’ll review how we got to the current states, the lessons we learned, and the future of Real-Time Analytics at Netflix.

While Netflix’s scale is larger than most other companies, we believe the approaches and technologies we intend to discuss are highly relevant to other production environments, and an audience member will come away with actionable ideas that should be implementable in, and will benefit, most other environments.

BDT303 — Running Spark and Presto in Netflix Big Data Platform
Thursday, Oct 8, 11:00AM — Palazzo F
Eva Tse, Director of Engineering — Big Data Platform
Daniel Weeks, Engineering Manager — Big Data Platform

Abstract: In this talk, we will discuss how Spark & Presto complement our big data platform stack that started with Hadoop; and the use cases that they address. Also, we will discuss how we run Spark and Presto on top of the EMR infrastructure. Specifically, how we use S3 as our DW and how we leverage EMR as a generic data processing cluster management framework.

SEC310 — Splitting the Check on Compliance and Security: Keeping Developers and Auditors Happy in the Cloud
Thursday, Oct 8, 11:00AM — Marcello 4501B
Jason Chan, Director of Cloud Security

Abstract: Often times — developers and auditors can be at odds. The agile, fast-moving environments that developers enjoy will typically give auditors heartburn. The more controlled and stable environments that auditors prefer to demonstrate and maintain compliance are traditionally not friendly to developers or innovation. We’ll walk through how Netflix moved its PCI and SOX environments to the cloud and how we were able to leverage the benefits of the cloud and agile development to satisfy both auditors and developers. Topics covered will include shared responsibility, using compartmentalization and microservices for scope control, immutable infrastructure, and continuous security testing.

We also have a booth on the show floor where the speakers and other Netflix engineers will hold office hours. We hope you join us for these talks and stop by our booth and say hello!

— by Ruslan Meshenberg and Josh Evans

Originally published at on October 2, 2015.