It’s important and they aren’t doing it.

I had an interesting conversation the other day with a someone on PlatoHQ. He was an early engineer with his company, and grew with it — from a junior software engineer straight out of school, to now being a Director, managing multiple teams. He lamented that some of the teams under his purview seemed to lack accountability.

“How does this manifest?” I asked. They push bad code to master — in fact, they push code that breaks the build right before heading out for a long weekend! They don’t seem to care about things being deployable, or that their colleagues might spend hours with git bisect to figure out what they did and why.

That’s certainly not a healthy situation, and one that’s certainly not unique to this individual and this company. Here are some thoughts that I hope were helpful to him, and might be helpful to you, should you find yourself in similar straits.

Ask these questions

Here’s a handy design pattern, if you will, for these kinds of debugging sessions, when a person or people are just totally failing to do something really important and it drives you up the wall.

First, reframe, if possible: go from “these people suck, what’s up with them?” which may very well be what you are feeling in the heat of the moment, to “these people mean well and not doing the thing I want them to do to, why?

Easy for me to say, I’m not the one having to explain to my VP that we are delaying that feature release again because the build is broken again and having to hear her lecture me about running unit tests, like I don’t know. This is important, though. Work on yourself, get your mind to accept the possibility of there being another story.

Once you’ve gotten yourself together and are ready to debug, ask the following three questions:

  1. Do they know that this is important?
  2. Do they know why it is important? Is it important to you, or is it important to them? (Bonus self-awareness points if you realize it’s actually not important, and tell them you’ve seen the light).
  3. Do they have the right tools and environment to achieve this important thing?

Once you have the answers, a solution will likely start to emerge.

Actually, I’ll save you the trouble. Nine times out of ten, it’s about improving communication between individuals and between teams, and about shortening feedback cycles while giving people greater amounts of ownership and autonomy, and giving people awesome tools to get their work done. Easy-peasy. You can probably skip the rest of this post.

I’m on a tear, though. Here’s a thousand words on “why the hell is the build broken again.”

Debugging the broken build culture

First things first, let’s do some sanity checks. Do they know that CI should be green? Yes. Is this CI thing new, could it be they don’t know what the error messages mean or how to check status? No. Do they know why it’s important that CI should be green? Yes. No easy solution.

Do they actually not give a shit at all? If you really do have engineers who don’t care at all about the quality of their work, and have to be reprimanded and policed just to get the project to compile, you need to fire them, and hire ones who will take pride in what they do. Chances are, though, that’s not what’s going on. Most people prefer to do a good job, rather than a bad job. It feels bad to do a bad job.

So we come to the harder structural stuff. Something is causing the engineers to ignore that bad feeling, and leave a mess for others to fix without even a “So sorry, I’m in Tahoe with barely any reception, please revert my push, I owe you chocolate/booze/a nice thank you note.”

I suspect one or both of the following are happening:

  • The team lacks a sense of ownership
  • The tooling is so abysmal that having master be green is not a reasonable expectation

Let’s take them one at a time.

The word you are looking for is “ownership”

Chances are, the team in question does not feel true ownership of the code being built; and it’s hard to make someone feel accountable for a thing they don’t own, and perhaps feel is out of their control.

One aspect of ownership is knowing what’s happening and why. How many people contribute to the codebase that’s being tested? More than a handful, and the sense of ownership is dispersed. (Codebase, for the sake of this discussion, refers to the scope of the failing build — this could be a git repo, a submodule, a build tool target, or however it is you partition things). If you have 50 people regularly pushing code into one application, and the build breaks multiple times a day, everyone just stops feeling like it’s their problem. You might get requests for DevOps or a QA team to “own the build.” Don’t do it. Engineers should own their code. They will if they feel it’s theirs. Fix your organization so they can. This might mean fixing your build structure or repo setup; and it also might mean dealing with your teams having these crazy interdependencies that cause code to be slung all over. Maybe even redefine teams.

Another aspect of ownership is knowing you are the decision maker, and therefore responsible for the outcome. Is someone in a managerial role — team lead, manager, perhaps even you, the early engineer who is now the director but knows the ins and outs of the key codebases and has opinions — constantly hovering over every pull request and diving into details, critiques choices of variable names, and generally controls things? This sends a strong message — the hoverer is the owner, and the programmer is an extra set of typing hands. Knock it off, or get whoever is doing this to knock it off. Oversight in moderation.

Anything else going on with ownership? Address this, and you’ll get your accountability. Assuming the tools are there.

Invest in your tools

So, the team is in control of their domain, they are rulers of their little universe, they take pride in the thing they build… and the CI is still red at the end of the day. Seriously, what gives?!

Let’s talk tools.

How do they find out the thing is broken? E-mail notifications that go to whoever set up the CI job and left the team (or better yet, the company!) months ago? If it’s important, it should fail loudly. Slack notification into the team channel (not into some generic “build status” channel). Monitors on walls. An animatronic parrot coming to life in the middle of the team area. PagerDuty even, if it’s actually critical.

Next, let’s look at them tests.

If CI on master is the only thing running them, guess what, you’re going to get broken builds.

Can people run the complete test suite in their dev environment? How long does it take? Allocate time to make tests fast. Then make them faster. If tests are so hard to set up people don’t bother, or if they take so long they don’t wait to find out, it’s to be expected that they will skip running the tests.

Build hooks into your code review system to run all tests in CI when a pull request is made or updated. Also run pre-merge checks, as the master branch drifts during the review process. You’ll need a bigger CI cluster, it’s worth it. (Lots of hosted autoscaling ones now! They’re cool!)

Make sure the master branch is defended. I know it’s a PITA. Invest in automation and templates for your CI scripts if you have a lot of repos and separate little CIs. Or bite the bullet and do the monorepo thing (good luck and god bless).

Are some tests flaky, do they “just fail sometimes”? That’s a killer. Happens for all kinds of reasons — shared state in the test runner that’s not properly cleaned out, random seeds, timeouts… root these things out. Don’t let “just run it again” be an excuse; maybe don’t get the sprint derailed to fix that sort of thing, but set the expectation that fixing flakiness is a priority for the next one, and turn off or even delete the flaky test (it’s fine, the code is in git). Build should be green. Don’t let incrementing timeouts to be an acceptable solution, either — this just makes the build take longer before you find out it’s busted, and if you don’t know why things time out now, they will time out later, too.

Ok so maybe unit tests run in a blink, and there is no flakiness, and branches don’t get merged till they are green, and… still. Do you perchance have separate “long-running” integration or perf tests that only run on master, and on cron, at that? That’s a tough one. You can make it better — incremental tests based on small targets and explicit dependencies, for example; shared, immutable pre-generated input data; other tricks. Some stuff is still going to take a long time. Best I’ve got so far is to get it scheduled such that everyone knows when to expect it to fail — and that time shouldn’t be midnight. At least move the problem from “oh crap, build’s broken and everyone went home” to “haha, here is our regularly scheduled 10am integration build alarm.” Other ideas? Please post a comment.

If this all sounds like a ton of new work, yeah, it is. This is what you should spend that QA or DevOps headcount on. Making tools better, making your engineers better. My friend Peter Seibel wrote 5000 words on the subject, and you should read that, since if you are still reading this, you obviously have the patience — but tl;dr: dedicated engineering effectiveness / developer productivity engineers are good, once you reach 100 people or so. IMO, if you add general CI/CD tooling responsibilities to that, they start paying off much earlier.

Summing up

People want to do the right thing. If they aren’t doing the right thing, either they don’t know it’s right, or it’s too hard to do it. Tell them what you want, and why. Telling them to do the right thing is unlikely to be effective. Consider that maybe they are right and you are wrong. If you are right, figure out why they are doing the wrong thing; remove the obstacles on the desired path.

Like what you read? Give Dmitriy Ryaboy a round of applause.

From a quick cheer to a standing ovation, clap to show how much you enjoyed this story.