How’s “You Build It, You Run It” working for you? Presenting the Pit Stop, to reduce blind spots and improve cross-functional impact!

Published in

swile-engineering

7 min readJan 10, 2023

tldr: keeping collaboration efficient between the Platform team and the Feature Teams can be challenging in a “You build it, you run it” organization. Here is one of the rituals Swile implemented to ensure that the apps are up to standard when it comes to the run-side of their lifecycle.

YBIYRI

At Swile we are strongly “You build it, you run it” oriented. The run includes security, on-calls, hot-fixes, performance… This philosophy helps us increasing our Feature Teams number while keeping a sustainable Platform Team size, as the responsability for a large part of the application lifecycle activities is distributed inside each team.

The Platform Team is building, well… the Platform (landing zone, common services…)! So they handle the run for it.
The Feature Teams are building these microservices, so they handle the run for it. Simple.

However, there is no wall here thanks to a common goal. The Platform Team shares the same objective as the whole engineering: we want to make a better Swile experience for our customers. This means that our job does not stop at delivering an awesome platform (which we also do).
➡️ The road to success is a shared responsability (something that goes without saying, but it is good to remind it from time to time).

Platform is contributing to this success by providing out-of-the-box tools and opinionated “generic” configurations; for example to bootstrap new Kubernetes micro-services (a so-called “blueprint”). By using those, Feature Teams can go faster and benefit from a “premium” support on our end, whereas if they decide to go with some exotic tools, we will not be able to bring the same kind of expertise (but we’ll try).

One key success factor of this model is seasoning the autonomy with empowerment (and support).

- “Build or run?” / “Why don’t we have both?”

Teaching how to fish & being a team player

Give a man a fish, he’ll eat for a day; teach a man to fish and he’ll eat for a lifetime…

… is the motto behind our empowerment principle at the Platform Team. You cannot ask someone to handle the run without proper training (and support!) on how to do so. When a team needs it, we are making ourselves available to help them understand what is under the hood and how the platform works.

Furthermore, when a new engineer joins Swile, they are enrolled for a series of training and mentoring to understand the platform stack, architecture, patterns… This onboarding aims at ensuring every engineer knows how to interact with and extend the platform.
Knowledge is shared on a regular basis within teams of course, but also during several (optional) rituals:

Weekly tech meetings where prepared subjects, treating a wide range of areas, are shared to the community
Guilds focused on specific areas (it could be a language, a product…)
Specific team rituals that each team implements as they choose

Finally, bringing support to teams whenever they ask for it is our second priority (just after “fixing a critical bug”). We do not “fire and forget” knowledge. Even if you’ve been told how to fish a few years ago does not mean you’ll remember how, when comes the day you starve, if you have not practiced since.

Every one of those actions are contributions to improve our Swile experience for our end-users, fully embracing our customer-focused culture.

It’s the little things…

So I guess we have it all figured out right?

The thing is, when I said our objectives were aligned, I might have overstated it a little. At a global scale, it is true. But let’s look a bit closer.
Feature Teams have a stakeholder who is from the Product. Delivering top notch features is their day-to-day priority. And if the run is done on a best-effort basis, then related metrics will degrade progressively (decrease of performance, more bugs (critical and non-critical), architecture debt…)

Moreover, you have self-organized teams. Meaning there are as many ways of running apps as there are such teams, with various degree of maturity. Having a trustworthy view of each application’s “operability state” can be challenging (not to say impossible) when you reach a certain number of services.

Is the app :
- late on dependencies?
- not leveraging correctly capabilities of the platform?
- lacking observability?
- having random low latency requests?
- crashing often (but maybe not enough to raise alarms)?
… are some various example of bad smells, that could easily be overlooked.

Also, when we hear from the Feature Teams, it is mostly via support requests, which is not ideal.

Another problem brought by the “You Build It, You Run It” is the difficulty to act on every micro-services running on your platform. For example, you might want to ensure each application is using a new configuration (a kubernetes annotation that brings more security?). Here, Pareto principle strikes once again: 20% of your effort will cover 80% of your workload… But the 20% remaining will count for the biggest effort!
There are various reasons to that, some of them being:

lack of time (= more pressing priorities)
tech debt that makes that change difficult or risky
little understanding of the value of this change (eg, when asking to change a Kubernetes resource’s manifest in order to stay up-to-date with the Kubernetes API)

You could argue that if we were following blindly “You Build It, You Run It”, then, well, it is their problem. If, Platform-wise, you have done all that you could (education, communication during “Guilds” meeting, packaging…) and they do not want to do their part, then they cannot complain if their service is crashing in a few weeks…
Another, more proactive way is to implement safe-guards: you are not up-to-date on a component? Well, the platform is closed for you until you are. If done properly, this solution can help a lot: instead of chasing the non-compliant apps, they will fix it or come to you for help.

…but we said previously that quality of our top-notch product was also something Platform wants, no matter what it takes. Leaving people behind hoping that they will catch-up is not the best way to achieve our common goal. Blocking people on the platform level is a good way to create two silos, arguing at each other (and a good way to promote shadow IT!)

Presenting the Pit Stop

So we came up with a solution.

Starting point:
Everybody wants the same thing: making the best product.
Feature Teams want to improve their run, but they cannot invest too much time at learning and improving that part, because they have a product roadmap to implement. They need an efficient and contextualized way to learn and to fill the gaps.

Simply put, a pitstop is:

A day dedicated to improving an application’s operability (Observability, Reliability and Efficiency)
Putting together a feature team and a couple of SREs

Any team can ask for a Pitstop. We plan it a few weeks ahead so everybody can clear their agenda and their head.
A few days before the Pitstop, we plan a one-hour call to list the subject that the Feature team would like to cover. This meeting helps to reduce the scope (sometimes the Platform Team expertise area is misunderstood), but also is needed to prepare the subjects on each side. Solutions often need some kind of discussion with peers, to validate the tradeoffs. And as the Pitstop itself is only one day, we do not want to lose time there.

A racetrack with caption representing the steps of our “Pit Stop”: start, app architecture presentation, … — A day with the Platform Team

Once the day starts, we begin with a presentation of the application: what is its value for the company? what is its architecture? what platform capacities does it use? Using a product like excalidraw.com to enact a “white board” feel is a good solution.

Observability is key: first, look at the metrics and the dashboard. What do we want to improve today? What, according to your gut, isn’t working well and you’d like to have some formal metrics to evaluate it?

The deep-dive. To be more efficient, we tend to break-out in two groups, to cover more subject and avoid having “lurkers” in the call: everyone has to be “a doer” today! At the end of the day, we schedule ad-hoc meetings to finish what has been started, we write tasks tickets and post a short summary of the findings and related PRs for the other teams to have a look.

Some examples of what has been achieved during those pit stops :

Divided by 5 the CI time (from push to auto-deploy)
Found better custom metrics for some Horizontal Pod Autoscaler
Created a new improved logging strategy and steps needed to implement it
Refined the liveness & readiness probes
Investigated on “JavaScript heap out of memory”
..

Looking back

After having done it for a few teams and iterate on the process to make it smoother, we are happy with the results.

The name “Pit Stop” sticks and has been applied to a similar ritual for other teams (with Velocity’s Pit Stop, to enhance your developper experience!).
The feedbacks from the teams are good and the Platform feels like they are providing efficient help by pairing.
The Platform team has been able to look onto the applications and the challenges they face. It is a good way to bring knowledge but also empathy. Furthermore, it has been really helpful to discuss new subjects to add in the Platform backlog.
We took time to remove various “rocks in our shoes”.

➡ Whatever words you put behind (“You build it you run it”, DevOps, SRE, …), ultimately, you have to use whatever means necessary to achieve your goal: making your app awesome. And the first job of everyone is to work towards that goal!