Pets and Cattle — Infrastructure for Software as a Service (SaaS)

Ever wonder how the big SaaS (Software as a Service) companies run their web infrastructure? How do they maintain the levels of reliability and security we’ve come to expect? Most of us take these things for granted when using email, or a search engine, or when sharing updates across our social networks.

At ACL Services, we deal with customer data, and therefore confidentiality, integrity, and availability are extremely important to us. We use as many best practices as possible to protect our customer’s intellectual property, ensuring it’s available whenever they need access.

In this blog post, we’ll introduce an important architectural trade-off we’ve made, affectionately known as the “Pets versus Cattle” decision. In essence, we treat our web application servers (Linux servers) as if they’re nameless farm animals that come and go on a regular basis, rather than loving household pets we grow and nurture over the years. With this approach, we’ve improved the quality of our solution, reduced the amount of manual labour, and have enhanced our level of customer service.

They’re all Animals… What’s the Difference?

It should be obvious that we treat household pets differently from farm animals, but how does this relate to servers?

Pets — You name them and love them

In the traditional world of web infrastructure, Linux servers were lovingly handcrafted by the IT/Operations group, and manually configured with all the necessary packages. The software on these servers would be patched or upgraded from time to time, often in response to application needs. It was quite common for these servers to be up and running for years, possibly until the hardware failed.

The similarity to pets is that each Linux server has a name (such as fido.acl.com or kitty.acl.com), and servers are lovingly nurtured with software updates. Except for scheduled maintenance, the servers are kept running 24/7/365. If you install a package, or create a file on the disk, it’ll almost certainly still be there in two years from now. For anybody with a computer at home, this is exactly what you’d expect to see.

Cattle — There are many of them, and you don’t notice when they come and go

In contrast, cattle don’t usually have names, you don’t get emotionally attached to them, and you might not even notice if they disappeared from the farm. From your perspective, they’re hard-working beasts that get the job done, but they’re easy to replace.

In the world of Linux servers, especially in cloud-centric environments, it’s very easy to create new servers and even easier to destroy them. They don’t have cute and fluffy names, but instead have auto-generated IDs. We don’t care if they fail, since it only takes a few minutes to reproduce them again. If you need more servers during a busy part of your day, just run a few scripts and you’ll have more servers. Once your customer traffic has died down, just terminate the servers. It’s as easy as that.

Well… it’s easy if you’re using Cloud technology such as AWS (Amazon Web Services), but clearly it’s not possible if you maintain your own data centre of physical machines. The Pets versus Cattle discussion is only relevant with “Elastic Compute” environments such as Amazon EC2.

Cattle are the Future… At least for SaaS

At ACL, we believe that the Cattle approach is the right solution for us, and we’re no longer using the Pets approach except in countries where AWS isn’t available. There are numerous benefits in dynamically creating and destroying Linux servers:

Automation is mandatory — no more handcrafting

We’ve all had those experiences where servers were constructed years ago, and nobody really remembers what’s on them. Administrators had periodically logged on to make configuration changes, but nobody remembers what those changes were. Then comes the day when the server fails… How do you reproduce the failed machine? Where do you even get started? Hopefully it wasn’t a critical part of your infrastructure, because it’s no longer usable.
With the Cattle approach, we’re forced to automate the entire provisioning of the server. Given a “stock” Linux server (say, CentOS Linux), our scripts must recreate the entire installation, and do so in only a few minutes. For bonus points, we even version-control the configuration so we know exactly what changed, why it changed, and when. Gone are our fears of not knowing how to rebuild your servers.

Persistent data is backed up

With the Pets model of running servers, it was always tempting to store valuable data on the local disk, relying on your nightly backups to keep the data safe. We’ve all had those scary moments when you realize your software was storing data in the wrong directory, and guess what… it was never backed up! A hard-drive failure could obliterate a year’s worth of accumulated data.
With the Cattle model, we create and destroy servers on a regular basis. If valuable data was stored on local disk, we’d find out very quickly. No worries about losing customer data — we’d see the problem in our test environment.

Auto-scaling to address ever-changing traffic patterns

As with other SaaS companies, ACL receives web traffic workloads that vary throughout the day. For example, our North American web servers experience higher loads between 6am PST and 5pm PST, but significantly less outside those hours. Also, there are times when web traffic increases significantly as a result of marketing campaigns or webinars.

With the Pets model, we’d need to have enough Linux servers ready to handle the worst-case web traffic, which is simply a waste of capacity for an average day. With the Cattle model, we spin up new servers during the day, and shut them down at night, significantly reducing our IT budget and therefore the cost we pass onto customers.

Recovery from Security Violations

Many security incidents involve an intruder gaining access to a server, then installing malicious software. This software may go undetected for weeks while covertly stealing user passwords, or using the server as a starting point for further attacks. With the Pets model, this software could be hard to detect, causing endless amounts of damage.

With the Cattle model, the servers will likely be destroyed and recreated frequently, often within 24 hours. If the servers were to be compromised, they won’t stay that way for long.

Disaster Recovery is practiced all the time

Like most SaaS companies, ACL practices DR (Disaster Recovery) on a regular basis, to ensure we can recover our full environment if the servers were somehow lost. In the past, it was painful to recreate our set-up, involving many days of manual work. However, with the Cattle model, it’s now trivial to create a new set of servers configured with the appropriate software. Type a few commands, go grab a coffee, and the work is done for you.

In fact, recreating the server environment is so simple, we do it every time we deploy a new revision of software, ensuring we start with a fresh and up-to-date server environment each time.

Blue-Green Deployment for Zero downtime

As we’ll discuss in a later blog post, ACL uses a technique known as “Blue-Green Deployment” to roll out new versions of software. This allows each new release to be fully tested and approved before it’s made available to customers.

No longer do we risk deploying buggy code that must be rolled-back quickly before customers are negatively impacted.

More detail about this technique will be given at a later time…

And the Disadvantage is…

The use of the Cattle approach has proven quite successful at ACL. However, there’s one obvious downside — it’s a lot more complicated than the Pets approach. Creating the automation necessary to dynamically provision new servers can be challenging from a software development perspective. It’s not just a matter of clicking buttons on the Amazon Web Services console, but instead requires in-depth knowledge of the AWS programmable APIs.

Wait… is that really a disadvantage? Perhaps not, since at ACL we love writing high-quality automation to get things done, and we strongly dislike the idea of manual and repetitive tasks.

More about these topics in a later blog post, but hopefully we’ve made it obvious that the Cattle approach is our preferred technique for deploying server infrastructure.