Service Mash

4 min readJun 20, 2024

Once, I was a tech lead in a five-person team that owned 22 services, of which 12 were critical and highly loaded. We had dev, staging, and two production environments for each service. We also owned a few services for a deprecated product that no one had touched for years. Good times!

Look, Simba. Everything the light touches is our kingdom.

The team was amazing, but we had to be organized and intentional about navigating our infrastructure. This is how we survived delivering five-nines!

At first, operating your infrastructure is easy, as everybody is initiated into tribal knowledge. But the longer it goes, the more cryptic and chaotic it becomes.

Old services will become a complete mystery. Newcomers will bombard other engineers with questions. No one will remember a link to dashboards for that underused region. No one will know how to clean up the cache in that deprecated gateway. Everybody will be surprised when shutting down a seemingly useless role brings down the core infrastructure in flames, as it depended on it.

We need a way to keep all information about each service in one place. Hence, it is easy to explore your infrastructure, onboard newcomers, quickly find guidance for incident mitigation, etc.

Zero Context

The best way to approach communication and documentation is to assume zero context. Once, I got a work email that was something like this:

Your HWO in the HPS is missing an AHID. It can’t be fulfilled.
regards,
John

That’s it — not a single link or explanation. This is my first hearing of HWO, HPS, AHID, or John. Can you guess what it was about?

Always assume your audience might not know things that are obvious to you. Add a link to documentation, briefly explain your intentions and the end goal, and expand abbreviations.

Essentials

Let’s go over the things essential to know about a service!

Often, it is really hard to find out what is upstream from your service

General topics are:

Summary
Why this service is needed? Is it active? Is it deprecated? Draw a high-level component diagram to explain its main flow.
Routing
Routing to a service, especially a public one, can be complicated. It can be its domain name provider, TLS certificate, load balancer, target group, service mesh configuration, reverse proxy port, etc.
Upstream and Downstream Services
Dependencies (downstream) can be easily located, but dependants are also important as they will suffer from contract changes or issues. A good list of upstream and downstream services would explain each connection's reasoning and routing to that resource.
Links and Resources
We need to know the repository, build, deployment, documentation, threat model, etc.
Dashboards and Logs
This also includes telemetry, crash reports, and business intelligence.
Runbook
Is this service stateless? Can a node be replaced? Does it have any admin endpoints? What are typical issues, and how can they be solved?
Contacts
Which team owns it, what is on-call rotation, and who is the most knowledgeable engineer?

Simple Wiki Page

As we will see, there are multiple ways to store this information, but just making a wiki page per service will have an immense positive effect on many engineering aspects.

After a few years of experimentation, I recommend creating the following page template, filling it out retroactively for existing services, and adding it as part of the Definition of Done for new ones.

Architecture Diagrams

Multiple methodologies and tools allow you to model complex architecture, which captures most of the abovementioned aspects. The obvious benefits are visualization and different ways of representation.

Service definitions are just one dimension of architecture modeling, which is outside the scope of this article. Nonetheless, I recommend adopting the C4 model and considering the Ice Panel, a neat and powerful tool.

Software Catalog

Software developers love having as much metadata stored declaratively in a repository as possible. The open-source software catalog Backstage by Spotify does exactly that!

apiVersion: backstage.io/v1alpha1
kind: Component
metadata:
  name: my-service
  description: Description of my service
  owner: team-a
  tags:
    - javascript
    - backend
spec:
  type: service
  lifecycle: production
  owner: apurin@weird.tech
  system: my-system

What is even more tempting is that the parts of the backstage configuration can be directly used as the actual service configuration, removing the need for information duplication.

Service Mesh

While previous options were meant to capture and visualize metadata, the service mesh (Istio, for example) owns the transport, monitoring, and security controls. It is not easy to adopt, as it would require careful tuning of the control plane (service discovery and routing rules) and introducing a new data plane (actual proxies that route the data between roles), but it provides many benefits.

While most service mesh implementations provide visualization, they are usually not meant to replace the service catalogs. Often, they are inaccessible to some of the engineers who will benefit from service metadata.

P.S. — The mysterious email I got from John was just a notification that my order for a computer mouse in an internal portal was not on the approved hardware list, so the order couldn’t be fulfilled.

That is a real story, not a “The interviewer came in; he was the dog” story.