Enterprise Architecture == Legacy Architecture: Volume 1

This thing we call Enterprise architecture is fatally flawed. Its flaws have been canonized, preventing it from evolving the way a technology should. Enterprise architecture is so broken and antiquated in today’s world that it will take me many blog rants to sufficiently cover it.

Today I would like to focus on disaster recovery — aka reliability, HA, resiliency, or whatever the latest, hippest jargon is for having your stuff continue to run.

The unspoken truth, known to all good technologists, is it that DR doesn’t work. Things always fail. When things fail, they don’t fail the way you expected.

Not only does DR not work, it’s very expensive. I once worked on a project, deemed “business critical” to a financial institution, where nearly 90% of the cost was related to DR and availability.

There lies the crux of the problem: Infrastructure always fails. It is an inevitability.

How did we find ourselves calling inevitable situations “disasters”? Why do we put so much effort into the unattainable goal of creating faultless infrastructure?

I believe the cause is the terrible state of enterprise software development. Enterprise software tends to be a complex, stateful monster that assumes that the infrastructure it runs on is stable. I think this approach is of great financial benefit to large traditional tech companies, and that they have had a lot to do with promoting this line of thinking.

While there were several factors that were leading to the status quo being overturned, the one that brought them all together was cloud computing. In the early 2000’s, the cloud computing model did not fit very well into the way technology was being handled by most large businesses. The first AWS instances were fairly under-powered and ethereal — they did not come back on reboot. They did not support a number of features most corporate data centers require, like complex backup, DR, and network segregation. In the minds of most corporate tech types they were not to be trusted or relied upon.

But… cloud instances were cheap and fast, and required no capital outlay. You could pay for your usage on a corporate card and ask for forgiveness later. You could get your servers in seconds, at a time when the normal lead time to get servers for new projects was measured in weeks or months. Physical servers required considerable engineering and the involvement of multiple IT teams, like facilities, networks, storage, backup, and security.

In the end, if something is cheap enough, people will work around its shortcomings and adjust their thinking. Economics trumps traditional thinking. And that happened with cloud instances.

To cope with infrastructure that could not be relied upon, applications had to change. They had to be more stateless. Large complex applications became sets of microservices that could scale horizontally and be accessed with a simple API. Services were distributed across providers and across geography. Application designers understood that they could not assume that the infrastructure was stable and could handle failure. Failure was the normal state, not the exception.

When you take this to the extreme you get Chaos Monkey. If you really want to keep your app available, you must accept failure as the normal state. To engineer for failure, you need to force failure to become common.

Why am I writing now about something that seems so well understood? Because I still see the remnants of legacy thinking about enterprise architecture when I talk to people. I hear the same assumptions from vendor-certified architects, pushing the same status quo. I see these bad ideas baked into RFPs for cloud computing projects. I see people wanting SAN-backed cloud storage. I’ve been asked questions about the RAID level of object storage. I see smart companies with good products adding bad features in order to “check all the boxes” in their responses to bad RFPs.

I write about this now because I desperately want us to move into the future, and I want the future not to suck.