Tactical Observability

7 min readJul 1, 2022

An image of gizmo from the movie gremlins with the title text changed from gremlins to environments

Did you make the mistake of feeding your environments after midnight! Don’t worry, you’re not the only one. The only problem you have is that the environments have taken over and are holding your agility hostage!

Don’t despair I’m going to tell you how you can rapidly get some visibility into those environments with minimal effort (Even if you didn’t or even can’t build those environments anymore). Think of these as your own set of replicating good gremlins.

How did we get here?

Statement 1: We have customers.

Statement 2: Change is a risk and might affect our customers.

Statement 3: We shall de-risk change by creating an environment

That’s probably how it went, maybe add an architect or two and some erroneous statements about regulation that no one can really put their finger on.

We just set the nature of the game!

By creating environments, we’ve set the rules of the game, here is what tends to manifest in some form or other by choosing this path.

You have two tiers of operation, production & anything that isn’t production.

You can operate at a lower standard in any environment that isn’t production, we’re now investing in dual-class behaviour.

We need a ton of testing to get into production, but not to get into non-production(or a lower standard).

You have no choice but to create some massive monster of an E2E test suit that has its own magic team to keep it alive, further decoupling ownership & vital feedback from engineers.

Of course, I’m massively generalising, but it takes some serious grit to maintain an environment that mirrors production and is treated the same way. And more importantly, that doesn’t stand in the way of value creation and realisation.

The road not taken

The other choice was to go straight to production! OMG SCARY FACE EMOJI!

Let’s look at the rules of the game:

Clear risk, we push code and we could hurt our users. The rules say we have to push code straight to production. This gives us four places to innovate, our codebase, our local machine, our CI/CD pipeline and the production environment itself.

I love this path because we’re 100% focused on value creation, and the path to value realisation, rather than duplicating our efforts. Every environment is a mouth to feed, one that will need a little nibble of the value as it moves through it.

On the road not taken, we would have used super high test coverage, and decoupled capabilities (I’ll talk more about this in my next article.) to maintain agility inter-team.

We’d have employed synthetics to test code before real customers did, it would have been magnificent.

Problem-solving is constrained to the value creation and realisation path, and not distributed and shared among other citizens.

Highlander gif reminding people there can be only one!

STFU Aubrey! How can you help with all my environments

Tactical observability! Let me explain, what this is.

We’re going to write a bit of code to check something, could be as simple as working or not working, could be more complex. Then we’re going to make that agnostic to the environment it will run it, use terraform to craft its JSON payloads, and pop it on a cloud watch schedule. Then we’re going to pop all this into a monorepo, now anyone can build once and deploy to any environment.

The longer version

It’s as simple as prioritising the data points that would make a difference to environmental stability and democratising it for consumption by folks who can use it to do good.

So let’s say you have some 3rd party APIs in your environment, and you’ve got a bunch of AWS EC2 instances. Some folks will instrument those EC2 instances, cool, but it only going to help us in certain scenarios and in others cost us money to grab useless data.

On the other hand, for those 3rd party APIs, let’s hypothesise one of them is transitively failing, but we don’t know which one. This is where tactical observability comes in.

The first thing we want to do is test the API so we write a lambda, maybe in NodeJS & Typescript to hit that endpoint the way production does.

Diagram depicting AWS CloudWatch events, invoking a lambda with a JSON payload, and the lambda calling a 3rd party service and pushing a success or failure value into AWS CloudWatch custom metrics, which in turn is consumed by a custom dashboard.

We won’t include any information about the URL of the service, let’s presume that it changes per environment, but what we will do, is build a lambda that is invoked with a JSON payload that contains the URL or the credentials(better to keep these in secrets manager if you can and pass the correct ARN), and sends the success of the call as either 1 or 0(success or failure) as a custom metric in cloudwatch.

This represents our Atomic Unit, time for some composition.

The next thing we’re going to build is a terraform module, that module will need to provision 4 resources. These are:

aws_lambda_function to represent our zipped lambda.

aws_lambda_alias to create an instance of our lambda

aws_iam_role to create invocation role with permissions to push custom metrics into cloudwatch.

aws_lambda_permission to attach the role to our lambda.

One Cloudwatch Event Rule, with a target of our lambda, a cron expression saying run every 1 minute, and a JSON payload we’re yet to define.

This is our first unit of composition

Levelling UP!

Thus far we’ve built a lambda, got a meaningful result, and put that into cloudwatch so we can graph it out.

Then we pushed the provisioning back into Terraform so we could replicate that setup from Terraform. By using terraform to construct the JSON payload, we’re doing it from the place we’re most likely to know things like URLs, environment details, parameters & secrets.

Example:

One terraform module, comprised of aws_lambda_function, aws_lambda_alias, aws_lambda_permission, aws_iam_role, aws_cloudwatch_event_rule & aws_cloudwatch_event_target. The first module contains all the infrastructure required to provision the synthetic and record custom metrics. Another terraform module called apex, uses a composition that includes the first module, the diagram shows the apex module cascading the environment name to its child modules.

End Game Boss

Building on that first atomic unit, and wrapping it up so we can replicate it in any environment was super smart. We could do with industrializing this a little more, think about the conduit, we want to add more lambdas and roll them out to all the environments quickly, so commissioning new lambdas is a core focus.

By using Yarn 3 workspaces and optionally NX, we can keep the infra and code in a single repository, NX will let us use the “affected” command to only build the packages in the MonoRepo that have changed.

We use this with a matrix build in Github actions to deploy

A GitHub actions workflow showing build & test as a combined step, and then a matrix build-image and matrix-deploy-image step to parallelise GitHub action steps.

Bonus Level

Cloudwatch Dashboarding is still a bit of a faf with Terraform, but we could centralise these results to a dedicated monitoring account (I love this pattern, I’ll talk about it more soon.)

Update our lambdas to submit a dimension for the environment, again, value to be passed into the lambda from the JSON payload so you ultimately end up setting the environment name in terraform where you’re likely to know it.

We could even use dashboard sharing to embed this new dashboard into teams or a shared website so there is zero friction in getting to custom metric results.

End Credits

In retrospect, there were a couple of ways we could build this solution.

We could use dockerfiles for lambdas, but using zips and committing them back to the repo on build feels like a good choice. I’m assuming you’re using terraform cloud so you will have that lock-step of plan & deploy on the infra-side to roll lambdas.

We could do a cross-account lambda to ECR call, we end up with fewer images but it doesn’t really help us on the deploy side because a lambda points to a SHA, not a tag so rolling the latest:tag might not work quite as you think.

We could push the deploy creds into GitHub secrets and consume them in some fancy recursive loop.

When I think about this code, the nature of it, and the choreography, this is not code that is likely to change anytime, a lambda is written once, and it publishes metrics until you don’t need them anymore.

My preference is to have Github actions commit the built dist for the lambdas back to the MAIN branch when we build. Make sure they are web-packed down to one file (we’re still ES5 and node 14 with lambda, so this makes more sense than a new bundler.)

Our main branch will always have the latest build committed and ready to go, and to deploy, we’re just going to jump into terraform, run Plan and Apply in terraform cloud for it to detect a change in that source file, which we then archive, causing a cascade through our resource graph until we end up with a deployment plan.

TL;DR

Build some lambdas, test anything, and push the results into AWS CloudWatch custom metrics in some meaningful form.

Package all that up into a terraform module, create an instance of that module per environment.

Package up all those modules into an apex module and push in the environment details once.

Commission one new account for dedicated monitoring, and push the custom metrics to that account instead of each individual account.

Enjoy one dashboard, with metrics for every environment, that takes a single additional instance of the terraform module to scale to a new environment.

Tactical & Horitontally scaling observability.

Don’t be rigid

There are no hard and fast rules to this stuff, be tactical, focus on customer value, and on positive behavioural shifts. My proposed solution will net you immediate feedback on environmental issues for almost anything you want to surface and will scale quickly across n number of environments, regardless of how they were built.

It’s tactical, though, and this stuff belongs with the capability we’re observing, so if you do something similar to what I’ve outlined, don’t lose sight of fixing that tech debt and getting this back in the hands of the engineers.