Why is everything broken?

9 min readSep 8, 2017

If you don’t live and breathe tech, you probably don’t realize how much time is spent preventing and fixing bugs versus the amount of time spent building new features. Junior engineers may also not realize this, which is one reason that engineers are bad at estimation.

In my experience, first-time entrepreneurs without an engineering background are often dissatisfied with the first engineer or dev shop they hire. In some cases they might be right; there are some mediocre engineers and agencies out there, and without having the engineering skills to evaluate them upfront, you’re more likely to end up with one of them. But often it’s because of mismatched expectations. All apps have bugs, and you’re more likely to see them as the first user of a new product.

Debugging obscure problems is among the most challenging things that engineers do, and we do a lot of it. Even engineers who are not officially QA engineers typically spend more than half their time testing rather than writing new code. It’s exceedingly rare for an engineer to write a large chunk of code without testing it and find that it just works. Most engineers write a small bit of code, test it, fix problems, then iterate.

Let’s discuss some of the reasons that apps break and how to help your engineers improve the product’s quality.

Google and Facebook always seem to work. Why is our app always broken?

First of all, even big companies have more problems than you probably realize. If a million Facebook users aren’t able to log in for a few minutes, there’s a small chance that you would be one of the unlucky few who even notice the problem. If a thousand of your app’s users can’t log in to the product that is your life’s work, you won’t sleep until it’s fixed. You’re not on Facebook 24/7 (we all sleep sometimes, right?) so you won’t catch most temporary issues, but when your app is down in the middle of the night, you will know about it. Bugs might only affect one feature used by a subset of users; if it’s your app then you’ll stress out about it, but if it’s Facebook then it might affect a feature you don’t use.

Second, there is always a tradeoff between speed of development and quality. Startups tend to prioritize speed, while large companies prioritize quality. Of course everyone wants both, but there’s simply no denying that the more effort spent testing, the fewer bugs will slip to production. It might not seem like big companies are moving slowly, but remember that they have thousands of engineers working on many features in parallel. Even if they seem to release new features every few days, each of those features had likely been under development for months.

Third, large companies have more resources, in terms of both manpower and money. They’ve spent years developing tools to assist in testing and deploying code. They have large QA teams who test all the code that gets released to production. They have teams of brilliant Infrastructure Engineers whose full time job is to ensure smooth rollouts and to quickly diagnose and fix issues when they do happen.

This used to work. Why doesn’t it work anymore?

One of the most frustrating things for engineers and non-engineers alike is when something that has worked for a long time suddenly breaks. Bugs related to recently implemented or changed features make sense, but sometimes bugs crop up in portions of the product that were not intentionally changed. Why does this happen?

One reason is refactoring. In short, refactoring is the process of updating existing code in order to better support new use cases. For example, let’s say you’re working on the hottest new animal husbandry app. So far it only supports dogs and chickens, but the thorough market analysis you performed on the internet has convinced you that there is a huge untapped market for cats. The developer implementing this feature may copy and paste the code related to, say, the desired characteristics of each animal from dogs and modify it to remove “bark volume” and add “purr frequency” but keep copies of the rest of the attributes in tact. But instead of copy & pasting — thereby creating multiple copies of nearly identical code — she may (and should!) refactor the code to create extensible “mammal” code that can be used for both cats and dogs, thereby reducing the time and complexity of adding support for other mammals in the future. (For another a explanation of refactoring, complete with another silly analogy, see my article on technical debt.) Engineers do this all the time at both small and large scales. While widely considered to be good hygiene for a codebase, it does run the risk of breaking existing use cases.

Other causes of things breaking seemingly at random include infrastructure problems and problems with third party services. If your databases is having problems, you can bet your application doesn’t work very well. If a not-quite-as-critical component like a cache or a CDN is having problems, your app may be in a partially working state. Modern startups also tend to rely on a lot of third party services like Facebook’s API, Google Maps, third party chat services, analytics solutions, and many, many more. They can have outages too, which will impact your app. Third parties might also change things that your app relies on intentionally (Facebook is notorious for doing this in the past, putting whole companies out of business), although they usually they give you fair warning.

Finally, some things that work at a small scale don’t work at a large scale. Some code may work for 100 users but not for 100,000 users. This is why engineers worry a lot about scaling. A web server or, worse, a database may reach a tipping point and no longer be able to serve the new level of usage without an upgrade or optimization. There often isn’t a quick fix when you reach that point, so engineers try to anticipate future needs and ensure they always have some headroom. But some bottlenecks may be unanticipated.

It works for me. Why are some people complaining?

Everything might seem hunky dory to you, but as your company starts to acquire more and more customers, you’ll find your team spending more and more time fielding complaints from customers about things not working for them. Why does this happen?

Some reasons are straightforward. There’s a new Android phone on the market that you don’t have in-house for testing purposes, and it turns out there’s a bug that only affects that device. Users are on shaky Internet connections. It works in the four most popular web browsers, but some of your users are using Opera for some reason. People choose to interact with your app in a way that you didn’t anticipate.

Other reasons are more complicated to diagnose. Some esoteric bugs may have escaped QA because they happen so rarely. A bug that has a one in a thousand chance of happening becomes very likely to turn up when you have tens or hundreds of thousands of users. Some particularly annoying bugs lead to accounts or other data getting into a bad state, meaning users will still experience problems even when the bug is fixed until the data is corrected.

Sounds overwhelming. What should I do about it?

Most importantly, have a rigorous QA process. If you can’t afford to hire dedicated QA, then you or someone else on the team needs to do it. It should be someone other than the engineer who wrote the code — they will have already tested everything they could think of and the whole point is to get a second pair of eyes from a different vantage point.

Anything not tested before release is broken. QA engineers typically perform regression testing before a release; that is, they test existing functionality, even features that weren’t supposed to change. Teams without dedicated QA or resource constrained teams may make sacrifices by testing only features in the general vicinity of any intentional changes, but the gold standard is to test the entire application on every release.

Does this sound time consuming? It is. I have a lot of respect for QA engineers who are able to perform this sometimes monotonous work and remain alert enough to catch bugs in areas of the code that nobody expected to change. But there is hope: automation! Automated tests — sometimes referred to by their subsets unit tests and functional tests — can take some of the manual work out of the QA process. Code differs on how easy it is to automate testing. Testing backend code, for example, is easier than testing frontend user interfaces. Code can be written in a way that is easier to create automated tests, and experienced engineers who prioritize testing will consider this when architecting the codebase. Opinions differ on what portion of a test plan people should strive to automate, but most agree that 100% automation is not a realistic goal. It is always important that a human tests new functionality, especially visual changes.

Unit tests — automated tests that ensure the correctness of small units of code — are often ignored by early stage startups even though they have become generally accepted as essential at mid to large sized companies. The engineers may intend to add them later, but that almost never happens. The argument is typically that there’s too much time pressure to get the MVP or early iterations released in order to ensure that the product is even viable, and that writing unit tests at that stage is a waste of precious time. While it’s true that greatest benefit unit tests is future-proofing the product, you’re still going to need to test the code in some way so you’re really just substituting manual testing for automated tests. The former will have to be repeated for all future releases, while the latter only has to be done once. And nobody comes back to write unit tests when the code is no longer fresh in their minds.

As a non-engineer, you can help by instilling a culture of testing. Encourage all of the above, but also test your product out yourself. “Eating your own dogfood” (or just “dogfooding”) is a popular saying in the tech world that means that as many people internal to your company as possible should use your own product, especially beta versions that have not yet been released. Even the best developers and QA engineers can’t catch everything; the more eyes and ears using the product frequently, the more bugs you’ll find before they impact external customers. You can also help by ensuring a steady flow of communication between your engineers and customer service. The faster your engineers are made aware of problems, the faster they can fix them (on the flip side, don’t inundate your engineers with every single report of a problem from the field; not every user is reliable and not every complaint is bug). Also, gather as many details as possible from the customer. It’s very difficult to fix a bug if you can’t reproduce it, and that’s often the hardest part. Obtaining exact steps to reliably reproduce the bug will save your engineers a lot of time.

Find your balance

There is always a tradeoff between quality and development speed. The more time spent testing, the fewer bugs will be released to customers. But every product has bugs, and at some point you’ll hit diminishing returns. Depending on the maturity of your product, the impact of potential bugs, and the value that faster product releases brings to your company, the amount of time and money your company spends on QA will differ.

During college, I completed an internship at Intel, working on a new microprocessor. When you’re dealing with hardware, the stakes of bugs getting released are exponentially higher. You can’t just deploy a hotfix; you have to issue a recall of the chip (which happened to Intel once, and cost them $475 million). Well more than half of the engineers working on the chip (including myself) were validation engineers, responsible only for testing. A much smaller portion of the engineering team worked on actually implementing the chip logic. At most software companies, especially startups, the ratio is reversed. And that may be okay. Intel takes years to ship one product, and small defects would have humungous ramifications. Many software companies ship software multiple times per day, and can quickly correct defects. Which is better for your business?

Hope you found this post useful! Don’t forget to follow me here on Medium, my blog WTF Is My Engineer Talking About, Facebook, Twitter, or LinkedIn. And please send feedback and topic suggestions via e-mail.