My favorite sessions from AWS re:Invent 2018

Small curated list of some my favorite sessions from re:Invent 2018 related to things I like: scalability, resiliency, chaos engineering and (global) architecture.

Some of these talks are great for their content, some are great for their speaker, some are just great. All have very interesting things to take home to think about and learn from. As for the 1000 other sessions from re:Invent, no discrimination, I just can’t like nor check every talk :)So, if you think I missed a really great or important session, please do write a comment! Thanks.


Close Loops & Opening Minds: How to Take Control of Systems, Big & Small (ARC337)

Whether it’s distributing configurations and customer settings, launching instances, or responding to surges in load, having a great control plane is key to the success of any system or service. Come hear about the techniques we use to build stable and scalable control planes at Amazon. We dive deep into the designs that power the most reliable systems at AWS. We share hard-earned operational lessons and explain academic control theory in easy-to-apply patterns and principles that are immediately useful in your own designs.


Chaos Engineering and Scalability at Audible.com (ARC308)

At Audible, we have invested in chaos engineering. In this session, we describe the experiment frameworks and some of the testing we’ve done on AWS, including using serverless technologies. We also discuss the scalability testing that we performed in order to gain full confidence in our entire system.


Globalizing Player Accounts at Riot Games While Maintaining Availability (ARC314)

The Player Accounts team at Riot Games needed to consolidate the player account infrastructure and provide a single, global accounts system for the League of Legends player base. To do this, they migrated hundreds of millions of player accounts into a consolidated, globally replicated composite database cluster in AWS. This provided higher fault tolerance and lower latency access to account data. In this talk, we discuss this effort to migrate eight disparate database clusters into AWS as a single composite database cluster replicated in four different AWS regions, provisioned with terraform, and managed and operated by Ansible.


How AWS Minimizes the Blast Radius of Failures (ARC338)

At AWS, we obsess over operational excellence. We have a deep understanding of system availability, informed by over a decade of experience operating the cloud and our roots of operating Amazon.com for nearly a quarter-century. One thing we’ve learned is that failures come in many forms, some expected, and some unexpected. It’s vital to build from the ground up and embrace failure. A core consideration is how to minimize the “blast radius” of any failures. In this talk, we discuss a range of blast radius reduction design techniques that we employ, including cell-based architecture, shuffle-sharding, availability zone independence, and region isolation. We also discuss how blast radius reduction infuses our operational practices.


How Intuit TurboTax Ran Entirely on AWS for 2017 Taxes (ARC307)

In this session, Intuit presents how they prepared TurboTax to take the production load, and how they gained the confidence to run their 2017 peak activity entirely on AWS. They discuss resiliency testing, game days, operational run books, working with AWS Support, and how each of these activities impacted their confidence in their reliability and availability.


Applying Principles of Chaos Engineering to Serverless (DVC305)

Chaos engineering focuses on improving system resilience through controlled experiments, exposing the inherent chaos and failure modes in our system before they manifest in production and impact users. However, much of the publicized tools and articles focus on killing Amazon EC2.


Netflix: Iterating on Stateful Services in the Cloud (DAT406)

While stateless services are suitable for many architectures, stateful services are also useful and sometimes overlooked. In this session, we hear from Netflix about the unique challenges of upgrading stateful services in the cloud, architectural advice to make iterating on stateful services easy, and concrete tools and infrastructure you can use on AWS to make upgrading easy.


Breaking Containers: Chaos Engineering for Modern Applications on AWS (CON310)

You may have heard of the buzzwords “chaos engineering” and “containers.” But what do they have to do with each other? In this session, we introduce chaos engineering and share a live demo of how to practice chaos engineering principles on AWS. We walk through chaos engineering practices, tools, and success metrics you can use to inject failures in order to make your systems more reliable.


-Adrian