Case-Insensitive Solutions in PostgreSQL

Published in

Borda Technology

5 min readMar 7, 2023

Case sensitivity is a common issue in database development. Depending on the database technology used, case sensitivity may be required in some parts of an application while not in others. This often occurs when filtering data. For example, are “ISTANBUL” and “istanbul” considered the same? The default approach to this issue varies across database technologies. For instance, MSSQL is case-insensitive by default, while PostgreSQL is case-sensitive. In this article, we will examine solutions for case-insensitivity when using PostgreSQL.

Collation

Collation defines the character usage of a language or alphabet. It determines how characters will be used in equality, sorting, and case distinctions.

According to PostgreSQL documentation:

The collation feature allows specifying the sort order and character classification behavior of data per-column, or even per-operation.

24.2. Collation Support

The collation feature allows specifying the sort order and character classification behavior of data per-column, or…

www.postgresql.org

Collation is used to determine the database language in its simplest form. It can be used at the database, table, column, and query levels.

Points to consider:

According to the Npgsql documentation, collation support is very new and incomplete, so it should be used carefully. All alternative options to collation should be explored. For detailed information, see https://www.npgsql.org/efcore/misc/collations-and-case-sensitivity.html?tabs=data-annotations.
Collation can be configured as deterministic or non-deterministic. Although deterministic collations have been used for a long time, non-deterministic collations came with PostgreSQL version 12. Since case insensitivity comes with non-deterministic collations, attention should be paid to the version used.
One of the most critical disadvantages is that when non-deterministic collation is used, pattern-matching functions (e.g., LIKE) become unusable.

Lower Function

One of the most basic solutions is to apply the lower function to the searched value and the column in which the value is searched. Filtering is done over lower-case letters, enabling a case-insensitive comparison.

SELECT * FROM test1 WHERE lower(col1) = lower('value');

Points to consider:

When the Lower function is used, the indexes on the relevant column become unusable. Each search will be done as a “sequential scan.” This will cause performance problems as the number of rows increases. “Expressional Index” is recommended as a solution. For details, see https://www.postgresql.org/docs/9.1/indexes-expressional.html.
Applying this method to every query that requires case-insensitive filtering means doing a similar job repeatedly. Because the Lower function will be used constantly for case-insensitive solutions in the queries at the data access layer. In the development phase, both code repetitions will occur, and it can be easily overlooked.
When the Lower function converts the given value to lower-case, it considers the collation information of the database. Language-specific upper or lower-case differences can cause errors. For example, the lower-case writing of the word “F-I-L-E” for “TR” is “f-ı-l-e,” while for “EN,” it is “f-i-l-e.”

Citext Extension

Citext is a data type that can be added as an extension in PostgreSQL. It is a trusted extension and can be added easily. It stands for case-insensitive text.

CREATE EXTENSION IF NOT EXISTS citext;

When using Entity Framework Core (EFCore), it can be added as modelBuilder.HasPostgresExtension("citext"); in the DbContext OnModelCreating method. Then, a migration is needed.

The working principle of the Citext data type is the same as the Lower function. The Lower function is called in its implementation. Therefore, all the items written for the Lower function above are also valid for Citext. The most significant advantage is that it does not require using the Lower function in queries since it makes the relevant column case-insensitive. Since it is a data type, case-insensitivity is applied to columns, not queries. You need to specify the column type to use the appropriate field in the Citext data type in EFCore.

modelBuilder.Entity<Blog>().Property(b => b.Name).HasColumnType("citext");

Points to consider in addition to the Lower function:

A limited number of PostgreSQL text functions have been prepared to be compatible with Citext. Functions other than these will behave like the text data type (case-sensitive). For the relevant functions and detailed information, see https://www.postgresql.org/docs/current/citext.html.
After defining a field as Citext, case-sensitive searching cannot be performed.

ILIKE Operator

ILIKE, a specific operator for PostgreSQL, has the same features as the LIKE operator but is case-insensitive. If special characters defined for the LIKE operator (% and _) are not found within the search text, ILIKE can be used as a case-insensitive equality operator. Although using it for equality checks is incorrect, ILIKE is the most suitable solution when case-insensitive pattern matching is desired. For detailed information, see https://www.postgresql.org/docs/8.3/functions-matching.html.

Comparison of Citext, Lower, and Collation Performance

A comparison of case-insensitive solutions in terms of performance can be found in the following source:

Case-insensitive pattern matching in PostgreSQL - CYBERTEC

www.cybertec-postgresql.com

In summary, we can examine the table below.

Conclusion

Based on the above solutions and performance comparison, we can make the following inferences:

Making a column case-insensitive is more practical than making a query case-insensitive. Thus, indicating in the software analysis document that a field is case-insensitive shows that case-insensitive work will be done in all queries related to that field. In this respect, the Citext data type is a suitable solution.
If only specific queries require a case-insensitive search instead of the entire column, using the Lower function for that query is more appropriate. The need for indexes should be evaluated with increasing data size.
Collation should be used for language, even if it is not for case sensitivity. If this parameter is not correctly specified when creating the database, all other solutions can produce unexpected results.