3 Reasons I Hate Booleans In Databases

A lot of people use booleans when a timestamp works better. An old friend who has built systems for over 20 years got this wrong recently, but you don’t have to. Let me explain.

Photo by Jantine Doornbos on Unsplash

Reason 1: Booleans lack critical information

When you choose a boolean you are opting to information loss. You are capturing only two states true or false and that’s it. Just those two states! If you use a timestamp you can get true by the presence of a value, false by a absence of a value, and you get the temporal information of when the state changed. Once your timestamp is read from the database you can easily add a model accessor like user.is_deleted to convert the timestamp to a boolean in code.

Here’s a common business use case for replacing booleans with timestamps. When a user confirms their email they usually go through a hashed url and then a User.is_email_confirmed boolean is set on their record.

The problem with this boolean is that you can’t use it to find how long it took for the user to confirm his account. Wouldn’t it be helpful to the business to know the time between user signup and confirmation? Of course! How would we know if our confirmation email stop sending? Well sorry, you can’t because you opted for less information.

Reason 2: Timestamps perform just as well as booleans in SQL

Back in the early days of computing database programming environments were constrained and every bit of storage mattered. Did you catch what I did there? :) I can’t understand why with modern database environment we would prefer a less rich datatype.

But wait a minute… What about performance?! What about disk space?!

I’ve made a reproducible Postgres benchmark based on 10M rows to test the performance and index size. I welcome your feedback, but from what I can tell there is little to no performance gain from choosing a boolean over a timestamp.

When I first saw the results I was confused by how similar the index sizes are. After looking for a while, I reached out to Brian Stien, a Postgres guru. He explained that under the hood an index is just a series of pointers in a B-Tree which means the indexes would be similarly sized.

One trade off that we are making is storage space, by converting a 1-byte boolean to an 8-byte timestamp we are going to increase our storage 8x. But given this tradeoff between increasing storage cost and having state and temporal information I would make this trade every time!

Reason 3: Booleans can be poorly conceived state machines

I don’t want to rule out the cases where you will persist a hide/show state or an on/off. Many times booleans start as a single state toggle and then multiply into five booleans in an effort to keep track of a complex state machine. I’d like to call this anti-pattern BooleanMachines.

If you see is_approved, is_active, is_processing, and is_complete all in the same table there is a good chance you have a BooleanMachine. If you then look in the model and you have very sophisticated logic to calculate the state the entity is actually in, you’re probably looking at a BooleanMachine. One other downside of a BooleanMachine is that all of that logic to calculate the current state will need to be transferred through to analysts, reports and the data warehouse.

So what do you do if you know an entity will have multiple states like the example below?

Project Statuses
Created -> Pending -> Approved -> Completed

enum ProjectStatus {
  Created = 1, Pending = 2, Approved = 3, Completed = 4
}

Since there are more than two states this would be a better candidate for an integer-based enumerated field. Unfortunately, Enums also lack temporal data about state transitions.

Backend engineers usually seem to forget that keeping track of the current state while critical, only tells part of the story. Any complex state machine needs track its state transitions. Keeping track of transitions could be as simple as poking a record into a log table for each state transition.

select * from project_status_log limit 5
| user | project_id | status | timestamp
| 1 | 25 | 2 | 2000-10-10 00:00:00
| 1 | 25 | 3 | 2000-10-11 00:00:00
| 1 | 25 | 2 | 2000-10-12 00:00:00
| 1 | 25 | 3 | 2000-10-12 00:00:00
| 1 | 25 | 4 | 2000-10-16 00:00:00

Conclusion

I hope this article helps you become a better engineer and can help us build better systems. To recap here’s what we covered:

  1. Timestamps can store state and temporal information
  2. Performance of timestamps is on par with booleans in the database
  3. Implement state transition logging and avoid creating unwieldy state Boolean state machines.

If you liked the article please share, if you are stuck on a project and need help feel free to reach out!