Unicorn Rentals Reflection

Unicorn Rentals is a well-established Gameday experience offered by AWS. It follows a simulated day in the life of an engineer at a comically dysfunctional start-up, suffering on its cloud journey. It’s a great day that gives insight into some of the pitfalls of developing and operating an application at scale.

Tom Werner
Version 1
6 min readAug 3, 2023

--

Me and a fellow Version 1 engineer has taken part recently and detailed their experiences here An AWS GameDay experience. Version 1’s AWS DevOps share their AWS.

What I want to explore here is some architectural takeaways from this day, especially in the context of the more ‘traditional’ re-platformed style applications that many of our enterprise customers are trying to deploy to the cloud (unsurprisingly, not every customer is trying to deploy the next big thing bespoke application to the cloud).

One of AWS CTO Werner Vogel’s most famous quotes is “Everything fails, all the time”. It is a simple statement, but is easy also to overlook and forget. The core of the Unicorn Rentals game day is about propping up a poorly maintained, non-optimised application amid traffic spikes, network disruption and special surprises as your team progresses.

AWS’ cloud operation is robust, and uptime is great, but failures do happen. The London region fried in the summer of 2022. Kinesis imploded globally in 2020. There are also forced failures — cheaper EC2 instances and general-purpose storage are burstable and can degrade under production workloads without realising. Considering that everything will fail, my two takeaways are to consider knowing your application and being prepared for the inevitable.

Know your application

Building a resilient application is hard. It requires a deep understanding of the application and the platform that is used to host it. Within the context of AWS, thankfully the platform is well documented and mature frameworks exist for architecting upon it. But you also need to understand your application at a low enough level to understand how it can be effectively decomposed. If you can break a large product part, critical areas can have increased HA techniques applied and parts that are prone to failure can be contained to a smaller blast radius.

Decomposing an application and applying the correct techniques to improve resilience relies on understanding basic systems concepts, for example;

  • synchronous vs asynchronous interaction models — this will affect whether requests can be queued or simply be accelerated/cached,
  • the tradeoff between shared access object-based storage vs block storage, what are the performance implications?
  • the difference between horizontal scaling and vertical scaling. Applications will generally favour one of the two scaling methods, how does this limit scaling potential?
  • the concepts of coupling and monoliths, what components is an application made of, and can parts be externalised for performance, scalability or cost purposes?

Refactor your architecture for cloud

Once you understand these principles you can then adapt your deployment for the cloud, breaking it into smaller components, and utilising AWS services to take on the role of compute, database, storage and networking, as appropriate. Many of the COTS products that I see customers trying to migrate to the cloud follow a pretty predictable pattern;

  • A monolithic application and web server (likely Java)
  • A relational database i.e. MySQL or PostgreSQL
  • Objects stored on the application server’s disk (or a separate drive if you’re lucky)

How would you adapt this for the cloud? Everything fails right? At the beginning of the day, the Unicorn Rentals stack certainly hasn’t been adapted to the cloud, just a single application running on EC2.

First, let’s get our data off the machine that is likely to buckle under load, or is a prime candidate for accidental deletion.

  • Our objects can go to S3 (if supported) or EFS. Both have great levels of durability and good support from vendors.
  • Move the locally managed database instance to AWS RDS. This is again a no-brainer — RDS is my go-to set-and-forget solution for databases which I would recommend to customers large and small. If you have very high demands on your database (durability, latency or scalability), then consider RDS Aurora — the technology behind it is ground-breaking.

Now the data is safe, how can we make the application handle peaks in demand? This is when things can get tricky. If your COTS vendor is kind, they may have containerised your software for you already. If so, time to pick a container platform — the industry trend is towards container deployments, so this is likely to be a low-friction, well-maintained choice. If not you’ll be likely making a container in sheep’s clothing, otherwise known as a baked AMI deployment. Having an AMI or container to deploy is the foundation of deploying your application in a scalable, immutable, agile manner.

Why? Whether you’re using EKS or ECS for containers, or an Autoscaling group for your baked AMI, you’re building your application in a way that at best can scale horizontally to demand, and at the very least is less brittle to redeploy.

What do I mean by brittle to deploy? The industry is still moving away from snowflake deployments that strike fear into all. A brittle deployment is likely manual, unpredictable and slow to adapt when an urgent fix is required. Having a robust deployment method for your application will pay you back many times over. New software release? Bug found in current deployment? Installation corrupt? Terminate the deployment and let it come back, attaching it to externalised data stores, in S3, EFS or another EBS disk. This is the key benefit to adopting cloud-native techniques — fast, hands-off, reliable deployments should be the goal.

If your service supports HA (and you’ve paid for it!) your next stepping stone is horizontal scaling in line with demand. By going with the industry trend towards immutable deployments, you can leverage the autoscaling capabilities built into Kubernetes, ECS or EC2 Autoscaling.

By breaking an application apart you have isolated each part, allowing each to then be treated as failure domains. Depending on each component, the blast radius can then be reduced. Will Larson has a great write-up on how to [model failure domains](https://lethain.com/fault-domains/).

Who, what, when?

You have a scalable, robust deployment, but are you prepared for when things don’t go to plan? Would you even know if things weren’t until your users tell you?

The next key part is observability and planning for failure. Too many think Prometheus & Grafana when they hear monitoring, but AWS has monitoring ready to go, with very little friction to get meaningful data. It is my go-to, especially for Day 1 when operating an environment.

A dashboard with the basics of CPU, Memory, Burst Balances, Disk and Network should be your first port of call. During the Unicorn Rentals gameday, it is very easy to start reacting to the scenario, without even knowing where the issues *maybe* hiding. Observability is king — without it you are guessing, in the dark, blindfolded.

Metrics are one part, but logs are the other piece of the puzzle. Again, out of the box Cloudwatch Logs is ready to go, and with minimal configuration, the Cloudwatch agent will aggregate logs and can be easily searched using Cloudwatch Log Insights.

Red Alert!

How do you plan for failure? Consider it meta but you should run a game day with your team. Gain insight from Development, Operations, Architecture, Management, and Service Management about how they see what’s been built. Unicorn Rentals is a well-designed simulation, the room had representatives from every department, and everyone had takeaways from the day.

Running your own game day? It isn’t easy, and will likely several iterations to run effectively. What could you use as the theme for a game day?

  • Consider what-if scenarios with a tabletop exercise,
  • For those maturing their adoption of AWS, look into https://aws.amazon.com/fis/. Want to know what happens if you lose an EC2 instance or a disk slows down to a snail’s pace? This will give you the answer,
  • You may have automated deployments, but what needs to happen in service management to expedite a change?
  • Practice for real — break your staging environment and treat it as an incident. How effectively can you restore it?
  • How quickly and reliably can you test fixes and deploy them to each environment?

My takeaways were likely different to participants from other disciplines within our product team. But, the most important takeaway for anyone? Sometimes the only way to find your blind spots is to remove yourself from your usual context. Remind yourself how stressful a live incident is or see why shipping a ‘small change’ isn’t always easy. It will make you, and your product better.

About the Author:
Tom is an AWS Architect from the Version 1 AWS Practice, working with clients on their cloud journeys.

Version 1 has an ever-growing AWS Practice, is an AWS Premier Partner and recipient of AWS Migration Partner of the Year 2022.

--

--