In my last post I talked about some of the reasons we chose AWS as our computing platform. We’re about one year into our transition to AWS from our own data centers. We’ve learned a lot so far, and I thought it might be helpful to share with you some of the mistakes we’ve made and some of the lessons we’ve learned.
1. Dorothy, you’re not in Kansas anymore.
If you’re used to designing and deploying applications in your own data centers, you need to be prepared to unlearn a lot of what you know. Seek to understand and embrace the differences operating in a cloud environment.
Many examples come to mind, such as hardware reliability. In our own data centers, session-based memory management was a fine approach, because any single hardware instance failure was rare. Managing state in volatile memory was reasonable, because it was rare that we would have to migrate from one instance to another. I knew to expect higher rates of individual instance failure in AWS, but I hadn’t thought through some of these sorts of implications.
Another example: in the Netflix data centers, we have a high capacity, super fast, highly reliable network. This has afforded us the luxury of designing around chatty APIs to remote systems. AWS networking has more variable latency. We’ve had to be much more structured about “over the wire” interactions, even as we’ve transitioned to a more highly distributed architecture.
2. Co-tenancy is hard.
When designing customer-facing software for a cloud environment, it is all about managing down expected overall latency of response. AWS is built around a model of sharing resources: hardware, network, storage, etc. Co-tenancy can introduce variance in throughput at any level of the stack. You’ve got to either be willing to abandon any specific subtask, or manage your resources within AWS to avoid co-tenancy where you must.
Your best bet is to build your systems to expect and accommodate failure at any level, which introduces the next lesson.
3. The best way to avoid failure is to fail constantly.
We’ve sometimes referred to the Netflix software architecture in AWS as our Rambo Architecture. Each system has to be able to succeed, no matter what, even all on its own. We’re designing each distributed system to expect and tolerate failure from other systems on which it depends.
If our recommendations system is down, we degrade the quality of our responses to our customers, but we still respond. We’ll show popular titles instead of personalized picks. If our search system is intolerably slow, streaming should still work perfectly fine.
One of the first systems our engineers built in AWS is called the Chaos Monkey. The Chaos Monkey’s job is to randomly kill instances and services within our architecture. If we aren’t constantly testing our ability to succeed despite failure, then it isn’t likely to work when it matters most — in the event of an unexpected outage.
4. Learn with real scale, not toy models.
Before we committed ourselves to AWS, we spent time researching the platform and building test systems within it. We tried hard to simulate realistic traffic patterns against these research projects.
This was critical in helping us select AWS, but not as helpful as we expected in thinking through our architecture. Early in our production build out, we built a simple repeater and started copying full customer request traffic to our AWS systems. That is what really taught us where our bottlenecks were, and some design choices that had seemed wise on the whiteboard turned out foolish at big scale.
We continue to research new technologies within AWS, but today we’re doing it at full scale with real data. If we’re thinking about new NoSQL options, for example, we’ll pick a real data store and port it full scale to the options we want to learn about.
5. Commit yourself.
When I look back at what the team has accomplished this year in our AWS migration, I’m truly amazed. But it didn’t always feel this good. AWS is only a few years old, and building at a high scale within it is a pioneering enterprise today. There were some dark days as we struggled with the sheer size of the task we’d taken on, and some of the differences between how AWS operates vs. our own data centers.
As you run into the hurdles, have the grit and the conviction to fight through them. Our CEO, Reed Hastings, has not only been fully on board with this migration, he is the person who motivated it! His commitment, the commitment of the technology leaders across the company, helped us push through to success when we could have chosen to retreat instead.
AWS is a tremendous suite of services, getting better all the time, and some big technology companies are running successfully there today. You can too! We hope some of our mistakes and the lessons we’ve learned can help you do it well.
— John Ciancutti.
Originally published at techblog.netflix.com on December 16, 2010.