Defensive design

8 min readMay 27, 2018

We all love the possibility of doing something new. Greenfield projects, big rewrites, new framework/libraries, etc. But as scope creeps in, it slowly becomes the New Old Thing again. The cycle repeats.

I pick tools and approaches that fundamentally don’t allow me to make the same old mistakes. It is a form of Defensive design.

Here is a list of stuff that have aided in that journey so far. They are not perfect, there is something to love and hate. I’m not selling unicorns here. But rather the opposite — it’s the limitations that are considered as features.

Protocol buffers

Duh, how is this a defensive design?

Defense: Against fat models

You know, those fat mutatable models with a deep nested structure, having scattered domain logic thrown inside everywhere. This is where I’ve seen the ugly sides of OOP. I had a chance to rewrite certain parts of our application. One way was to start from fresh domain classes that start out slim. But there was no guarantee that someone wouldn’t add a domain logic method in there. All it takes is just one or two methods, and then it becomes the new norm. Everyone follows suit. Mocks for data classes slowly come up because they also happen to contain domain logic. And voila, we have a new old thing again.

Modeling them as protobufs sealed their fate. I could only use them as data holders now, and they just cannot become those scattered domain logic again. Immutability of these objects is a further added plus. There is no more possibility of adding custom domain methods to values, and then stubbing/mocking random stuff for testing.

Many languages offer similar defense using data/case classes (as in Python, Scala, Kotlin, etc). But they don’t restrict from adding methods or domain logic to it. This is where I get stronger defense from the data-only, sealed and heavily introspectable nature of protobufs. So much of the old architectural style was killed by using protobuf, that I started using them as a de-facto replacement for data classes everywhere — Messages in background queues, NoSQL documents, even POJOs.

Defense: Against half-baked contract management

XSD and JSON-Schema also try to define schemas for objects. e.g. Javascript is fundamentally untyped. This led JSON-Schema to describe types as a validation. Property “x” should be of type “number”. The natural extension of that was, the “number” should be between “0 to 10”, and so on. This took JSON-Schema from schema definition to data validation which is a totally different problem domain! Protobuf’s limitation feature is that it will let us do pure schema management without validation.

Data validation has a (1) constraint definition problem that needs a turing-complete language — dependent validations such as two dates should not repeated, two time windows should not overlap, a.b.c should be present if x.y.z is true and so on. Trying to represent these as “configuration” is wrong. It’s code. It has a (2) data explosion problem — try validating if a given day is a holiday or a weekend, or if a zipcode is valid (not just the length, but if the actual value is semantically correct). To do it right, you need to add all the known zipcodes/holidays in the country to the schema file. Lastly, it’s also a (3) network/IO problem: any place where a 3rd party system has to be called to complete the validation, e.g. if a given entity ID or phone number exists, if an association is present, etc.

JSON-Schema and XSD try to validate stuff, but they cannot cover it fully because it has computational/data/network complexity that just cannot be described as metadata. And this ends up in half-baked contracts. You assume that if your payload passes the schema, it’s a valid payload. It could be anything but!

While trying hard to further solve validation, some fundamental stuff is overlooked. Anyone know what’s the scale/precision of a number in JSON? How to properly represent a Bigdecimal or Date or DateTime or UUID? How to do first-class artifact management? Namespacing, referencing, embedding from different runtime schemas, backwards compatibility, preserving unknown fields, …?

Using Protocol buffers limited me to just use it for defining a schema. No half-baked validation logic. No serialization/deserialization issues. Guaranteed interoperable data types. It does one thing and does it well. These schemas are then easily shared everywhere irrespective of protocol or stack.

Golang

Defense: Against over-engineering

There is much to love/hate about Golang, which we won’t get into here. Let’s just look at the defensive capabilities.

You cannot run into an Over-engineered Abstract Universe with fancy pants class names and onion skin layers that keep peeling and peeling. You won’t find 16 design patterns for every 32 lines. You won’t find enterprise fizzbuzz.

Golang just doesn’t have a way to let your software become a legacy monster. People will have to work really hard to conceal code behind layers. If anything, it’s the opposite —code will be thrown right on your face.

It’s a very useful tool when rewriting systems while being safe that the same old legacy cruft cannot hit you back. Of course, you’ll definitely have withdrawal symptoms for a long time. When the brain and workplace is tuned to be rewarded for doing things “the complicated way”, there will be a heavy dopamine hit when writing the first few for loops. I wish more languages took the Golang approach and started adding defensive limitations in new versions, rather than bloated features.

Rust

Defense: Towards programming correctnesss

Rust is a specialized form of defense for programming correctness. To be frank, I have not dabbled in Rust as much to be able to write about it. But a huge swath of runtime errors are caught during compile time, and your program almost always runs without issues in production if it compiles, is one of the most sophisticated forms of defense that I’ve ever come across.

Rust looks quite defenseless against over-engineering and its easy to get lost in long type signatures and possibility of abstract universes. But the community seems to be doing a great job so far, and the language is getting better and better with each release.

Message queues

Defense: Towards resilient tasks

We’re not going to talk about Event sourcing, Reactive programming and Stream processing. Let’s go straight to the core defense: The fact that you can do things with resilience built-in.

I’ve seen distributed cron jobs being processed resiliently, simply by having a domain-logic free “cron-as-a-service” where consumers create a schedule, and it merely publishes each cron trigger as a message on a queue. I’ve scaled up many instances of consumers with a guarantee of the cron message being processed by only one of them at a time, with retries in case of failures.

I have had huge success with converting all background tasks to be queue-driven, running lightweight APIs that receive requests and just self-post a message to themselves to reliably process it. I’ve had all RESTful POST requests in a previous application that internally sends a message to itself, and responds with 202 Accepted. I’ve written some GitHub apps and integrations where whenever I need to do anything using the GitHub API, I make the application send a message to itself. “Please label this issue as X”. And it will keep retrying until that happens. The fact that you can also control concurrency and guarantee to be within rate limits by just increasing/decreasing the number of workers to fit. And went on to having critical operations as a single “Job” resource — just send all operations to POST /jobs, with every job having full fledged status monitoring and ownership, all modeled just as a message.

Overall, plain vanilla Message queues are a powerful reliability defense in general, without even getting into Event sourcing and Reactive programming.

Ephemeral Infrastructure

Defense: Against snowflake infrastructure

If you are stuck in a snowflake infrastructure company, all you need to do is just “ask for autoscaling”. A huge correction will automatically follow.

With elasticity, new instances can be booted up anytime, and the booted resource has to auto-provision itself within few minutes. This automatically pushes up the bar up for configuration management and infrastructure automation. Similarly, what goes up has to come down. This means, you cannot do old style “stop and drain traffic manually” with your operations team anymore. You have no choice but to implement graceful shutdown, connection draining and proper health checks.

Ephemeral infrastructure is a broad term. Immutable Infrastructure (disk images or containers) is just one part of it, and it takes it one step further. You’ll have to end up implementing even more practices like 12factor, rolling updates and removing silos.

By choosing ephemeral infrastructure, the engineering teams just don’t have an option to make the same old snowflake mistakes.

Chaos Engineering

Defense: Towards resilient infrastructure

At my present company, we constantly kill pods/nodes and trigger re-provisioning of resources. We use a customized form of Chaos engineering purely targeting kubernetes.

Having automated tools that trigger various types of infrastructure, network and application errors and failures is an effective early defense, even before deploying your first service.

Service meshes

Defense: Towards fault tolerant integrations

The world is a distributed system. Today it’s hard to write an application that’s completely standalone without leveraging a bunch of external APIs. Cloud resources, backing services, third party authentication, notifications and much more.

So far, I’ve gotten into a standard practice of enveloping every 3rd party call that goes out of my application with the following, irrespective whether it’s needed or it’s my own service on the other side or a third party: timeout, retry with backoff, circuit breaker, record metrics per status, print a log with metadata, add my own metadata (client, session, request identifiers, date/time, version), etc. Because you know Murphy’s law: If anything can go wrong, it will.

But the days of requiring a library to do all these for each call are slowly winding down, and these tasks are also becoming plug-and-play services in the form of sidecars or reverse proxies that you can pipe requests through.

In the context of defensive design, I’m slowly exploring service meshes such as istio/envoy, linkerd, kong, etc, using them for resilience and keeping the service discovery and configuration management parts aside.

Function-as-a-Service

Defense: Against bloated architecture

Warning: There are library/dependency management, business continuity (will my business continue to work in reduced mode after reaching artificial quotas/limits, or do I just shut store for rest of the month?), latency, caching, state management, vendor lock-in, lack of open standards, metrics, orchestration, standardized config management, unified communication gateways and a bunch of other open concerns with FaaS right now. This is an experimental item in this list.

The fact that I don’t have to worry about infrastructure at all is a huge defense. No 12factor or health checks needed. No complex infrastructure orchestration. Every service can evolve at the same time to major languages and environments, without needing to be rebuilt. This gives a different level of continuous delivery for operational teams.

You don’t have to wade through standard boilerplate, go through gazillion ways to configure and bootstrap things, and do an expensive redeploy to make small changes to domain logic. One function can do only one thing, which makes internal contributions and cross-team collaboration easier.

I’m still eagerly waiting for the next few years to see how/if this field evolves holistically. After two years, if we’re still only talking about idle billing, then so much for defense.

Conclusion

Those are some of my favorite Defensive design tools/approaches. I’d love to hear what are yours.