The ‘Captain Obvious’ method for super-fast software

Many times when a group of developers gather to discuss a performance or scale issue, the discussion goes along these lines —

Developer1: Ok, I think we’ve reached the limit’s of MySQL, we should really consider migrating to Couchbase.
Developer2: Yeah , but if we are already doing that we should really change the architecture to event sourcing. Facebook is using it on millions of events per second.
Developer3: Absolutely. This will really allow us to grow! But it will create an issue when we aggregate the events; we will need to cache it on Redis.
Developer2: Of course, of course, goes without saying.

So yes, it might be slightly grotesque. And yet, how many times has a similar discussion happened in your team? How many times did it seem that the only option was to ditch the old code and just rewrite everything? How many times was a cluster of micro-services suggested as the solution to every scale problem? How many times have you called a table with a mere 100 million rows ‘big data’? How many times did you find out that switching to a shiny new technology was just a new way of shooting yourself in the leg?

So really, why is our software so slow?

The real reason behind most performance issues is that people write stupid code.

Most performance issues I've encountered were not because we’ve reached the limits of our tools or our architecture. They were usually because of a more mundane origin: people forgetting indexes, ignoring caches, not bulking queries, writing chatty APIs, fetching way more than needed. And, in general, acting in a shortsighted manner.

And I know what you are probably thinking right now: Gee.. thanks… really helpful. Is your point that all we need to do is hire people who don’t write stupid code?

Well, no. Because you see… there is a problem with that assertion. So allow me to fix it —

The real reason behind most performance issues is that even smart people write stupid code.

The real question here is not how to find people who don’t write stupid code, but how to prevent smart people from writing stupid code.

And to answer this question, we will need to answer a question you probably already know the answer to.

How do you make sure your software is free of bugs?

That’s a simple enough question. So let’s answer it together —

First you write automatic tests, and you run them against every code change. If the tests pass, you know everything is OK - and when tests fail, you know you have a problem. And when only a specific subset of your tests fail, you can pinpoint the source of the problem in a matter of minutes.

Tests save the day, every day,

And yet, tests will not catch everything, so you will also have to build a process to monitor and analyze bugs in production. You’ll usually have monitoring and logs set-up to catch errors and allow you to find and diagnose production issues quickly as they happen.

The obvious secret behind high performance software

Now that we’ve found an obvious solution for smart people writing bugs, the path for an obvious solution for fast software seems clear -

Treat the performance of your system the same way you treat the quality of your system.

Yep. That’s all. It’s that simple. Just use the same tools you use to improve your code quality to improve your performance and scalability.

When you are continuously testing and monitoring performance:

  1. You are no longer blind, since now you can finally separate the bulls#%t from reality. And find the small and obvious optimizations that will make a difference.
  2. You can analyze the root cause of a performance issue hiding in a complicated flow in a matter of minutes.
  3. You can improve performance and then easily prove it.
  4. You prevent your performance debt from growing uncontrollably.

That’s it, all rather obvious. And amazingly effective. Because when you can detect and test the root cause of your system slowness, triaging, finding the root cause and actually fixing things up, becomes a simple and structured process.

Five obvious tips for effectively monitoring and testing performance

  1. Measure flows, not just infrastructure. make sure you can quickly answer ‘What is the experience my user is having?’
  2. Monitor all of your IO and external API dependencies, as they are usually the root cause of most performance issues.
  3. Don’t measure only time — remember also to measure call counts and other relevant metrics such as cache hit-miss. These metrics will help you catch flows that are not necessarily very slow by themselves, but in large number do become a problem.
  4. Think in percentiles — When you monitor, don’t look at specific data points or at averages, since they tend to be noisy. Think in percentiles instead — i.e. Don’t measure the average time of the experience , instead ask ‘What is the performance that 90% of my users experience?’
  5. Test with percentiles — performance tests will tend to be flaky, as many factors (including hardware, size of data ,GC) will fluctuate the results of your tests. By setting your percentiles , i.e. ‘the experience will not be slower than the time it takes for 95% of my users’ , you create a stable yet effective test in the long run.

Following these tips will allow you to quickly extend your existing automatic-testing skills to validate and catch performance issues, in addition to quality issues.

Thank You Captain Obvious! Your work here is done.