By Paul Campbell, founder of Tito.
Tito is a web application for selling tickets online that focuses strongly on user experience for both event organizers and their attendees. The core of a great user experience is trust. For a web application, that starts with reliability. In this blog post, I’ll explain how we’ve built a zero-downtime infrastructure to maximize reliability and customer trust.
The Nightmare Scenario
Last autumn, I was visiting my in-laws when I got a notification that Tito — the online-ticket software service that I run — was down. The VM that the load balancer was running on failed, and all requests to the site, the app, and the service were failing.
My face reddened, and I excused myself from my wife’s family. I panicked. I started screaming expletives. I stayed up all night rebuilding servers and configuring things from scratch. All weekend I was miserable because of a lack of sleep and blood-pressure related stress. I cursed all of the members of my DevOps team that I would fire on Monday morning.
Relax, It’s Not Real
Actually, I don’t have a devops team: it’s just me. And as a matter of fact, I didn’t panic, and I didn’t worry at all. I just switched the DNS to point incoming traffic to the hot failover load balancer, and five minutes later, it was as though nothing had ever happened.
The Tito web service is a bootstrapper’s nightmare: Tito requires a flexible feature set to handle the complex set of activities involved in organizing an event. It needs to be highly available because it powers millions of dollars worth of ticket sales for our customers each month. When a surge of unexpected traffic occurs for an event for which customers want in-demand tickets, Tito needs to absorb the load as though the spike is part of a normal day’s work.
Flexibility, Availability, Scalability
As an early-stage bootstrapped company, we don’t have a lot of money to burn on systems administration. However, we need to be able to trust that our application will handle the demands of our customers.
So, three things are crucial to Tito’s technology stack:
- Flexible feature set
- High availability
- Fast scalability with little to no notice
The app itself is built using Ruby on Rails, backed by a MySQL database. Low-level app caching is done using a Redis database, and front-end caching uses a Memcached cluster. Worker servers handle tasks such as order processing, export generation, and sending emails.
It’s a fairly classic stack, and it gives us a platform on which we can build those flexible features that our customers love.
Optimize for Trust
We experimented with a variety of setups over the first few years of building the product. Even though Tito is still in a very early stage, when I went to build a new stack last year I had one goal: to build and maintain customer trust (and selfishly, to maximize my ability to sleep at night).
Optimizing for trust means one thing: planning for failure. My ultimate goal for the stack was to apply a simple rule: no service can be running that doesn’t have another service running that can take over in the event of a failure. Not the database, not the load balancers, not the app, not the workers, not the caches.
In the past, building a highly available stack like this would have been so challenging that it would have been a full-time job for me. Thankfully, with the launch of Multi-AZ auto-failover support for Redis last October, the last piece of the puzzle for me was complete.
The “Trust” Stack on AWS
Today, every piece of the Tito stack runs at least two instances:
- MySQL database, running on Amazon RDS with a Multi-AZ auto hot-failover
- Amazon ElastiCache Redis and Memcached, both running Multi-AZ with auto hot-failover
- Two HAProxy load balancer servers, running Multi-AZ
- Two app servers running Multi-AZ at all times, with additional servers added as demand increases during the week
- Two worker servers running Multi-AZ at all times, with additional servers added as demand increases
We orchestrate all of this through AWS OpsWorks, and we use Amazon SES to send thousands of emails every day.
This setup has helped us to achieve 99.995% uptime in the last three months, process over $10 million in ticket sales for our customers, and establish our business as a reliable, trustworthy service.
Flexible features, high availability, and fast scalability. Using out-of-the-box features with AWS, we’ve been able to achieve all three in a very meaningful way. But the real story is just how far AWS has come in terms of how straightforward it is to get here.
Just Check the Box
RDS provides high availability for MySQL as a one-check box opt-in, with automatic hot-failover. ElastiCache supports multi-node, one-check box opt-in for Memcached, with auto-discovery requiring only an app restart in the event of catastrophic failure. ElastiCache Redis, like RDS, supports one check box opt-in for hot-failover Multi-AZ caching.
Selecting three check boxes got us a full, high-availability data store and cache. Not only this, but RDS allows us to update to a higher capacity database with about 60 seconds of downtime.
At the app level, once a cluster is configured on OpsWorks we can start instances and complete a high availability stack just by clicking the Start Instance button in the console. If I need to schedule a fleet of new instances during a known spike in traffic, I can simply create “timed instances” that spin up automatically during the times that I check off in the schedule.
The Dream Scenario
And that’s more or less that. Tito today is an app that has a flexible feature set facilitating millions of dollars of transactions every month to a happy customer base. It’s backed by a solid, highly available infrastructure that can handle heavy loads with no notice, and even bigger loads with relatively short notice. All of this complex orchestration and infrastructure is mainly achieved through a series of opt-in check boxes and controls in a simple web console.
I promise you, I’ve been sleeping very well at night.