How Violating The Principle of Least Surprise Can Introduce Subtle Bugs to Your Code

Kacper Madej
u2i’s lab
Published in
4 min readSep 26, 2017

If you were to name one common thing to all software developers working in various domains and languages what would it be? I would say bugs! Bugs and a career in software development are inseparable. You introduce bugs, you discover bugs and you fix bugs introduced by others. Most of them are obvious and easy to fix but some are a nightmare to track down and can cause a lot of harm. And one of the reasons such bugs are introduced are APIs that violate the Principle of Least Surprise.

The Principle of Least Surprise (or the Principle of Least Astonishment) is a general design rule which states that your design should match the mental model of your expected audience. In the domain of API design your audience consists of the developers that will have to work with the code you wrote.
There’s a great talk by Gary Bernhardt about various Javascript features that don’t follow this principle. I highly recommend you to watch it.
I’d like to tell you a story about how we were hit by this principle after migrating a gigantic Ruby on Rails monolith to Rails 4.

“But there’s already Rails 5!” you might say — undoubtedly there is. But with such a large codebase it’s not trivial to always be up to date with the technology, especially given that most of our work is development.

The system I’m writing about has 180k lines of code and 125k lines of tests in one Rails App. Plus we also have other services living in different codebases. We’ve been building this whole ecosystem from scratch for 7 years; and for the last 3 years we’ve had 20+ developers constantly involved in it.

The application tracks various things on the Internet and generates various reports on them. For the purpose of this blog post let’s refer to the things we track as Records and additional information related to them as Metadata.

After we deployed the new version of the system to production it turned out that we had a very serious bug lurking in our code. From the outside it looked like the number of records we tracked dropped by around 30%. This is all because of a small change in our codebase related to an API which violated the Principle of Least Surprise. We were really surprised when we found it because the code itself had nothing to do with our tracking system and it wasn’t in the code’s intent to make any database changes.

It turned out that the faulty part lied in one of our data exporters. The code flow is pretty obvious:

1. Fetch the records from the database
2. Have some business logic filter out some of the records we don’t want to export
3. Serialize them
4. Export the data to S3

I will simplify the code just to show you the point. The real code is a bit more complicated and the faulty part of it was changed more than once. Here’s the simplified version:

Now, what could possibly go wrong? Below is the original version of filter_exportable_records:

In Rails 3 if you inspect record.metadata.class it is a plain Array. But in Rails 4 it's an instance of ActiveRecord::Associations::CollectionProxy which doesn’t have the select! method. So a straightforward way to upgrade the code is the following:

When reading this blog post it should be now pretty obvious to you, but in case it’s not: we’re reassigning something to an ActiveRecord relation, which… makes a database call and removes metadata that do not conform to some_business_condition? from the database! But all of the metadata is crucial to our tracking system!

But anyway we weren’t able to catch it despite having quite a rigorous release process. We had two experienced developers pair-programming on this task and two more reviewing the pull request. Those four people had years of experience with Rails. So it’s neither obvious when you change the code nor when you look at the diff. And it couldn’t be caught with tests because you simply don’t expect a piece of code that exports some data to alter the database in any way. It just has to be observed on a living system. And unfortunately nobody noticed this issue on staging so we had a lot of fire to extinguish on production.

We all know that assigning an Array to an ActiveRecord relation makes a database call, but as I’ve just shown you sometimes it’s just too easy to miss. In my opinion either the = method shouldn’t make a database call or it shouldn’even exist at all.

But it’s highly improbable that this behavior will change in the future. How can we live with it and protect ourselves against this kind of bugs? I can think of only one solution. Do not mutate ActiveRecord objects when you have no intent of actually modifying your database. So the original code would have to extract all the data it needs into some Plain Old Ruby Objects and not mess with the AR objects themselves. But it requires more code and is not as convenient.

What can we do as developers in general? I believe we should design our APIs in a way that they are simple but not too simple. We should strive to keep a balance between expressiveness and explicitness. Too much expressiveness will lead you to bugs like the one described in this blog post; too much explicitness will make your APIs inconvenient to use.

--

--

Kacper Madej
u2i’s lab

Experienced Ruby programmer. Wannabe machine learner. Available for contracting work.