… but there are many things I hold to be true that make it less so.

The lessons never stop in the world of distributed programming. As a system scales and evolves, the number of possible failure modes and scenarios grow proportionally.

There are many lessons that I have learned over the years of navigating the torrent waters of large-scale, service-oriented architectures. Much of the advice I offer and decisions I make on a daily basis are gut instincts based on such experiences. But the reasons are often not obvious, and can often seem counter-intuitive or limiting.

First, read Jeff Hodges’ article on distributed programming: http://bryanpendleton.blogspot.mx/2013/01/jeff-hodges-on-distributed-systems.html

Any advice below assumes a distributed system that: * Is service-oriented * Is spread across multiple regions * Strives for 99.9%+ reliability * Hosts user data * Has at least linear organic growth

Design for partial failure everywhere. In the world of computing, anything can and everything will fail at least once. If your iPhone has given you zero problems and your Mac has been solid since you opened it then consider yourself lucky. Yet there’s a very good reason the Genius Bar is almost always busy, those devices fail all the time. And your commodity hardware will too.

Servers, load-balancers, firewalls, NICs, disks, memory, racks, cabling, power that run your applications, you name it, are not infallible.

Make everything idempotent. Idempotence is essentially a mathematical property that means applying the same operation multiple times will not change the result. When making a request, it’s often the case that the caller does not know if a failed request simply was lost but still succeeded (perhaps partially) or failed altogether. Without idempotence, retrying would be impossible. Imagine the case where a machine in a cluster completely fails while processing a request, the client would be left with a disconnect yet still want a result, so it retries to another machine.

Maintain a Test Universe. This is critical. You want a silo’d universe that completely mimicks your production world. Developers must be able to work in a sandbox that does not touch production-critical data all while having the confidence that their code will work in production. Developing against production data is a great way to corrupt or lose user data, which is the worst possible thing you could do. Do this early, and make sure that every feature, service and application you build works in the developer sandbox, and ideally, on the same hardware that runs your application in production. That way, an application that is deployed into test can then be promoted to production much more easily and safely.

Capacity planning is hard. Do it anyway. And err on the side of gross overestimation. Ideally, we would all build for 10x and design for 100x. This effectively means, you’ve load-tested your system to 10x its expected peak capacity and you know that it can scale horizontally to 100x by simply adding more physical hardware.

At Square, this was very important when we built out the infrastructure for supporting all of Starbucks card-processing. We tested our primary datacenters up to 10x

If it can be stateless, it should be. This typically applies to frontends or middle-tier layer services. State here refers to storing some data in a data storage layer. Jeff Hodges’ mentions this too, and I fully support it. Frontends can apply to either services that render web pages or serve data requests. You always want to minimize the number of places that store state, because when it scales it will be much harder to maintain consistency of state across those instances.

Imagine you’re happily running instances of PostgreSQL on the West Coast, but are scaling into Europe and need trans-Atlantic redundancy so your users don’t need to suffer 200+ ms to scale to East Coast… Nope. Think again. Synchronous replication is just as slow. Asynchronous replication will work, but introduces eventual consistency. There are solutions, but no silver bullets. Reducing where state is stored is the easiest way to manage this complexity so you don’t need to solve it too many times.

Reconsider that expensive flash storage. Expensive and powerful flash storage (such as Fusion-io) can be amazingly addictive. It’s a silver bullet for not-easily-partitionable data storage services. But, there’s a ceiling, and approaching it can turn into a crisis for your organization. This is essentially “vertical scaling.” If your application is successful, you will approach the ceiling and be left with “code red” projects to find clever ways to increase your runway all while re-architecting critical systems to horizontally scale.

Deploy early and often. Don’t wait a week for a deploy. In active development environments, a single application may not change, but the frameworks and code around it sure will. A service owner cannot know when their service may stop working as a result of an indirect change outside of their application. Deploying often is the only way to combat adverse side effects, because each deploy reduces the problem space for debugging any new problem.

Don’t copy user data. At least not user data that you intend to be served back. By “user data” i mean “data created on behalf of the user taking some action,” such as filling out a form or paying for a cup of coffee. A good rule of thumb is to think about how easy it is for the user to completely delete their data, should they choose to, accurately and comprehensively. When you copy it around, this becomes very difficult. Imagine that they own this data, so it’s important to treat it with the respect it deserves. It’s best to keep a canonical source for serving, all other sources can and should be treated as semantical caches or replicas for analysis or non-visible purposes.

Above is just a few lessons I’ve learned over the years.