How To Maintain Quality At Scale

Brian Culler
Salesloft Engineering
8 min readSep 27, 2017

--

If you are working at a fast-growing startup, maintaining a defect-free product over time is just plain hard. It takes determination and intentionality — but the good news is, it isn’t actually overly complex. Brian Culler, our Director of Engineering, explains how we do it here at SalesLoft.

Typical Lifecycle Of A Startup — And How It Goes Wrong

When you look at the different phases your product will go through, there almost always seems to be a higher priority than fixing some bugs. At the very early stages, when you are just trying to find product-market fit, every second has to be spent on really quickly getting MVP features out the door to see what sticks. Then if you are lucky enough to gain traction you start to become inundated with new feature requests, except now it is coming from both prospects evaluating you and existing customers needing more functionality.

Once things get serious then come the scaling issues — decoupling your MVP app into services, moving onto a scalable architecture, and refactoring non-performant components to handle the ever-increasing user load.

And then if you’re truly lucky and you’ve stumbled into an emerging space of software that is solving a new problem, most likely someone else has stumbled into it along with you. This means you have 1 or 2 main competitors who are trying to eat your lunch. It’s a land grab, so now you have to keep up with feature parity across a fast-moving innovative space.

You move fast, you break a few things, but so what? Sales keep rolling in and most of it works. Any showstoppers you’re probably taking care of, the rest of the edge case bugs get written up (maybe) and put in a backlog somewhere. We’ll get to them someday when things cool off and we have time. Right now, we’ve got 3 major features that have to get out the door and we need to rebuild our SSO system to handle larger customers.

Your Customer Support team probably starts out small and lean, but then inexplicably they are constantly overworked and have to keep hiring. The support tickets keep growing. Bugs that get a ton of tickets manage to get prioritized onto your very crowded roadmap, but only if people are really complaining.

The defect backlog grows. Every new problem affects more and more people since you have more and more customers. Your support team starts bringing it up weekly — the issues you never have time to fix keep getting more and more tickets. And before you know it, it’s just too late. You’ve dug yourself into a hole you’ll never get out of. You don’t have a bug backlog, you have a bug database. There’s no feasible way the executive team will ever swallow a 3-month hiatus to right the ship, and in the meantime, every new feature coming out is throwing regressions all over the place.

Your customers get unhappy. They leave negative NPS reviews stating “so buggy, nothing ever gets fixed!”. It gets harder to keep good developers since the codebase has gotten so finicky. Everything is on fire, all the time. You start to lose market share. Churn rates go up. You don’t understand how your competitors are able to move faster than you, even though they have a smaller team.

Sound awful? It is. And it will kill your company.

Here’s How To Make Sure That Doesn’t Happen.

It isn’t complicated. It is hard work though. You have to, at all costs, with every fiber of your being, make sure that your known defect backlog never ever ever gets out of control. If you don’t watch it, if you don’t put processes in place to manage it, it 100% will grow. You have to be intentional about it.

At SalesLoft we have managed to do just that so far. We have zero known customer-reported defects in our backlog that aren’t prioritized or being worked on. Every company is different, so tweak the variables below to what fits for your product and your company. You have to take into consideration your space (are you a bank? Or are you writing an emoji-sharing social app?) What will your customers tolerate? How often do you release? Do your users log in once a day or are their experiences more transactional? All of this should go into your decision-making, but here’s how we do it:

Stay Close To Support.

Your customer support team members are your eyes and ears. They are going to know when something is wrong faster than you will sometimes. They’ll know what the real pain points are. They’ll know what tickets they answer the same way over and over. Getting your Support team hooked physically into your Engineering team is a crucial first step.

During my first week here, just getting in and figuring out how everything worked, I sat down with the engineers and got a feel for how the application was working. Where were the bottlenecks? Whats getting the most attention? Whats the highest priority?

And finally — how many outstanding bugs do we have? The answer: “Well, none. Not that we know of anyway. We fix things pretty fast.” Nice.

I wheeled my chair over into the next room where Support sat. I threw the same question to the group at large — “Hey everyone, are there any existing bugs you all are aware of? Engineering says there aren’t any.”

The laughter could be heard 3 floors away.

Turns out the process to escalate a defect from Support to Engineering just didn’t exist. The best it got was if something really broke, someone would bang on the glass between the rooms and shout it out. Besides that, Support would maybe mention broken things to an engineer while getting coffee or in a random meeting.

You have to get Support talking to Engineering and you should use software to do that. Whatever ticketing system Support uses (We’re on Zendesk), get it hooked up via a push integration into your process management system (JIRA, Pivotal, etc). Teach the Support team how to write up defects. Include the affected component, reproduction steps and number of customers affected. Let them set a severity level. Make sure your integration supports 2-way updates, so when the issue moves through your process management system, all the support tickets get auto updated with status and resolution. Ours even lets us comment directly to all support tickets right from our process management system (JIRA, in our case). This part needs to be as easy and seamless to use as possible. You don’t want to create overhead from just the tracking and management of your defect list.

Define Your Service Level Agreement.

An SLA, in this case, is just a fancy way to say “We promise to do certain things in a certain amount of time”. We have an SLA with our own Support team, that we sat down as a team and decided upon. Engineering promised to fix all defects within a certain amount of time. We came up with 3 different severity levels — these may be different than what your company or product needs, but at a high level:

A Severity 1 basically means the application is down. It’s all hands on deck and we fix it immediately.

Severity 2 means something is really dang bad — you can’t import data into the system, or phone call recordings aren’t saving. We still respond to Sev 2s almost immediately, but usually, it’s just 1 person on a team who drops what they’re doing to start a fix. We set a max of 2 days to get it fixed, though in reality fixes usually go out within a couple hours.

Severity 3s are the vast majority of defects that get reported to us. Something is technically broken but it’s not a showstopper, you can work around it somehow, etc.

The critical key to all three is that they are time-bound. Immediate, 2 days, or 14 days, it almost doesn’t matter too much — but without that time based forcing function, none of this will work. Which then leads to the final piece of the solution…

Get Buy-in And Stick To It!

When you are coming up with your SLA, get everyone involved in the room. Start out with the thesis that quality is important, get the team to agree it is important, and that everyone will follow the agreement. Get your Product team onboard, your Support team, everyone. It doesn’t matter how great your process is if no one takes it seriously.

And finally — take it very very seriously. If you have a 14-day max, and a defect hits 14 days old, then literally go find the engineer it is assigned to and have them drop whatever feature work they are doing and jump on the defect. This takes discipline. The idea isn’t to let every defect hit 14 days either, but it’s a long enough time to make sure the team has the flexibility to make priority decisions. Have a huge feature about to go out? Fine, push off the defect for a few days. Just know that when it starts getting to 10 or 11 days old, you’d sure better be looking at it.

We actually have an Engineering Metric for this, called “Aging Defects” which tracks on a weekly basis how many customer-reported bugs are older than 14 days. Our goal for this is to be zero, every week, week after week. That’s how important this is.

Once you let it get away from you, it is so incredibly difficult to ever get it back. If you do this right and take quality seriously, what you’re left with is a frustration-free platform for your customers, a trustworthy codebase for your product development team, and orders of magnitude fewer tickets coming into your Support team. In the end, taking the time to fix things that break — when they break — actually lets you move faster in the long run.

--

--