(this is an answer to Jeff Potter — “3 Reasons I Hate Booleans In Databases”)
I’ll start with the “benchmark” because I like facts.
Here is your test which I ran on less number of rows (because it is not needed and easier to run and share from db<>fiddle) and I’ve run the queries once before in order to warm-up the cache. And I displayed the execution plan to get better understanding about the response time:
What you did here by replacing the 2-state boolean with a N-state timestamp is that you completely confused the query planner heuristics. Look at the second execution plan: 489 rows estimated instead of 49940. And then the optimizer choose a different plan which is not optimal here (220.956 seconds instead of 27.429)
Now, run the same with an analyze so that the cost based optimizer has more info about the number of nulls in your column. You overload the statistics metadata with many unneeded timestamps but at least the estimation is ok:
Now the estimation is fine and if you go to the db<>fiddle you can see why: the “null_frac” in “pg_stat” shows how many nulls you have. You can see the many “most_common_vals” that are now stored in the dictionary. And they will probably never be useful as your goal is to query on nulls or not nulls only.
Now that you have a correct execution plan you can see that it is exactly the same for your two queries: full table scan, which is the most efficient when reading 50% of the rows. No need to compare the response time: it is exactly the same amount of work done.
A better test would test on two different tables. And vacuum them as in real life you don’t expect 100% rows out of the visibility map. You will see an Index Only Scan and there, of course, very little difference. But anyway, this is not the model you expect in real life. You will probably never create a full index on one boolean column only. Either you want quick access to the few flagged rows, and that’s a partial index. Or you will just combine this boolean column in addition to other columns where you have a selective predicate.
lack critical information
Your physical data model has to store what you need, all what you need, and only what you need. There’s nothing like a generic domain model when going to platform-specific implementation. Your data model is designed for your use-cases. If you have a “Persons” table and you want to know who is married or not, you add this information as a boolean because that’s what you asked to your the user: “check the box if you are married”. You do not store their wedding date (which is actually the timestamp related to the state). And if you want to know when they entered this information, then you have probably a “last_modification” column for the whole record. And anyway, the database stores the state history (for recovery purpose) and can store it automatically for business purposes (triggers, temporal tables,…).
If you need this information, either you rely on what the database provides or you log/audit them. Like what you mention with “state transition logging”. But not for each column and each boolean! If you go that way, then what is the rationale behind storing a timestamp with “ User.is_email_confirmed” and not with “ User.email” to know when they changed their e-mail?
There is overhead everywhere by replacing a simple “True” by a timestamp. The optimizer statistics above was just an example. Think about CPU cycles needed to test a boolean vs. a datatype with calendar semantic. Think about the space it takes in a row, which can then cross the limit where data stays in cache or not (for all levels of cache).
By the way, a boolean can be nullable, which means that it can have 3 values. You may want to store the information that you don’t know yet if the value is true or false. By replacing it with a timestamp, you pervert the semantic of NULL: rather than indicating the absence of value, it now holds the value “False”.
poorly conceived state machines
Your third point is about the data model. Yes, from the relational theory point of view the need for a boolean datatype can be discussed. The boolean state should be implemented by the presence of a row in a fact table. Your first example about “User.is_email_confirmed” should probably go to a table that logs the confirmation (with a timestamp, maybe the IP address of the sender, …). But beyond the theory, let’s be pragmatic. One day, for legal reasons (like GDPR) you will have to remove this logged information and you will still need a boolean to replace what you removed. The boolean then is derived information required in the physical data model for implementation reasons.
Of course, if you need more values, like “Created -> Pending -> Approved -> Completed” in your example, you need another datatype. You suggest a NUMBER but you don’t actually need number semantic (like doing arithmetic on them). It can be a CHAR but you don’t need character semantic (like character set). The best solution depends on the database you use. PostgreSQL has an ENUM datatype. The most important if you use a CHAR or NUMBER is to have a check constraint so that the optimizer knows the valid values when estimating the cardinalities.
The funny thing is that I’m not advocating for boolean datatypes at all here. I’ve been working 20 years on Oracle which does not have boolean columns and I never have seen the need for it for a table column. A CHAR(1) NOT NULL CHECK IN(‘Y’,’N’) is ok for me. The problem comes with views and resultsets because you need to define the correspondence with the database client program. But Oracle provides PL/SQL to encapsulate database services in stored procedures, and this has booleans (and many non-relational data types).
And sorry for the long answer but I didn’t want to just add a “I disagree on all” without explanation ;)