On March 8th, 2011, I was fortunate to be able to deliver 10 minutes of the keynote address for the Cloud Connect conference in Santa Clara, California. Here are some of the points I made during the talk.
We started this cloud re-architecture effort in 2008 in the aftermath of an outage of our DVD shipping software in August of that year. An unfortunate confluence of events caused our systems to go down. We had singleton vertically scaled databases for both our website and the nascent Netflix streaming functionality. We knew those two systems were equally vulnerable. We had to re-architect for high availability and move to a service oriented architecture spread across redundant data centers.
In August of 2008, there were already web based startups that were not building data centers because they were building in the cloud. Some of those start ups will grow to be as big as Netflix and therefore Netflix gave serious consideration to building for the clouds during this re-architecture effort.
Why AWS (Amazon Web Services)?
Our definition of cloud is a public, shared, and multi-tenant cloud. AWS is the market leader and has been able to create a continuous and virtuous cycle. Large AWS customers demand (and receive) continuous improvements from AWS. Those improvements, in turn, attract more large customers and the cycle then repeats itself. Netflix has benefited nicely from jumping on and riding that virtuous cycle.
We went to the cloud looking for high availability. We found availability but we are also happy that we found a lot of new agility as well. Our software developers and our business found new agility by eliminating a lot of complexity.
Essential vs. Accidental Complexity : No Silver Bullet
In 1986, Dr. Fred Books of University of North Carolina, Chapel Hill wrote his famous paper entitled ‘No Silver Bullet’. This paper touches on a lot of things but the thing most relevant to this post is the contrast Brooks paints between Essential complexity and Accidental complexity. Essential complexity is caused by the problem to be solved, and nothing can remove it. An vital example of essential complexity at Netflix is our personalized movie recommendation system. Accidental complexity relates to problems that we create on our own and which can be fixed. In 1986, one example of retiring accidental complexity that Brooks wrote about was coding large scale systems in assembly language, because adequate high level languages were not viable. That accidental complexity was largely retired by 1986 when Brooks wrote the paper.
Accidental complexity is generational. Every new application domain repeats the cycle of early phases of accidental complexity that are eventually retired. In the mid 1990’s I was writing code that parsed raw http request headers. Everyone had to do that so they could write the early dynamic web applications that many of us worked on in those days.Building and running data centers is the accidental complexity of the 2011 generation. If you are building a data center that hosts less than multiples of 10’s of thousands of machines, then you are inviting complexity, centralized control, and process that you don’t need for your business. At Netflix, recurring issues of data center space, equipment upgrades, power and cooling fire drills, and data center moves were all accidental complexities that distracted from software development towards our essential complexities.Running data centers also requires an accurate capacity forecast so the equipment needed to add capacity is racked, stacked, and tested before it is needed. For Netflix, an accurate capacity forecast requires an accurate business forecast. Netflix’s good fortune has made this difficult. We started 2010 with just over 12 million subscribers and finished the year with over 20 million subscribers, far above what we predicted at the beginning of 2010. The newly added load put us at risk of running out of data center capacity. At the same time we were re-architecting for the cloud. We moved over 80% of our customer transactions, mostly for movie discovery and streaming, to the AWS cloud. The elasticity of the cloud enabled us to absorb that growth with little pain. The move to the cloud also allowed us to eliminate a lot of the centralized process required to run data centers.
Killing Process: Freedom and Responsibility
You may want to take a look at the Netflix Culture Deck, found at jobs.netflix.com. It talks about how we love killing process and lot about our value of Freedom and Responsibility. Here are 2 relevant sentences from the culture deck:
- Our model is to increase employee freedom as we grow, rather than limit it.
- Responsible people thrive on freedom and are worthy of freedom.
Implementing Freedom and Responsibility in our service oriented cloud architecture means the following things:
- Each engineering team owns their own deployment. They push changes and re-architect when they need to without seeking widespread alignment and without a sign-off process.
- Software developers own capacity procurement. In the cloud, adding cpu and storage are simple API calls.
- We don’t have a single point of control over cloud spending. We’ve had a few bugs that consumed extra resources, but we also had those when we had a more centralized process for adding capacity to our data center.
Centralized process and control were needed in the past to help manage the complexity of operating our own data centers. We eliminated a lot of that complexity by moving to the cloud and these three facts of operating in the clouds at Netflix have delivered a tremendous new agility as our business and engineering teams continue to grow.
Availability and Agility
We moved to the clouds looking for availability. We have also found a tremendous agility by eliminating complexity, process, and control. There was a steep learning curve and moments of doubt along the way but the end result is that Netflix software developers now have a lot more freedom to innovate and evolve our architectures rapidly as the business continues it’s rapid growth. We continue to seek great talent to add to our engineering teams. I hope you’ll take a look at our open positions at jobs.netflix.com.
VP Engineering, Systems & ECommerce
" No Silver Bullet - Essence and Accident in Software Engineering" is a widely discussed paper on software engineering…en.wikipedia.org
Originally published at techblog.netflix.com on March 8, 2011.