Choosing the Right Case Insensitive Collation in Snowflake: A Deep Dive into Upper and Lower Collations

Photo by Emile Perron on Unsplash

Collations play a critical role in comparing and sorting string data, enabling Snowflake to handle strings based on customized rules of locale, case sensitivity, accent sensitivity, and punctuation sensitivity. Snowflake’s recent full support for the upper and lower collation specifications offers more efficient options for case-insensitive string handling. This blog walks through these new additions to help you make informed decisions so that you can pick the right collation.

Understanding Upper and Lower Collations

Unlike locale-aware collations such as en-ci, Snowflake’s upper and lower collations are locale-agnostic, transforming strings to uppercase or lowercase, respectively, before performing comparison and sorting. Locale-agnostic collations bypass the need to fetch collation keys, thus eliminating lookup time and improving performance.

With the full support of upper and lower collations, Snowflake now offers three primary specifications for case-insensitive collations:

  • locale-ci: Utilizes locale-aware collation keys for language-specific comparisons.
  • upper: Converts strings to uppercase internally, eliminating the need for collation key lookups.
  • lower: Similar to upper, but converts strings to lowercase.

Choosing a Case Insensitive Collation

Generally speaking, applying collations to string columns almost always results in additional computation. Therefore, if you can manage without specifying a collation, that would be the most performant design.

If a case-insensitive collation is necessary, we generally recommend the upper and lower collations over the ci collation due to the performance improvement mentioned earlier. We understand that switching collations is not a simple task, but we encourage you to test the upper and lower collations with your slow-running queries to determine if the switch is beneficial. However, it is important to note that having consistent collations across the database is recommended for consistent join, comparison and sorting behaviors.

When deciding between upper and lower collations, consider the distribution of your data, although the difference is usually marginal. If your string data is predominantly uppercase, then upper should be your choice to align with the existing case format. The same applies to lower.

Key Considerations and Caveats

Behavioral Differences: Although ci, upper, and lower can all be used as case-insensitive collations, they do not behave identically in all scenarios. The differences fall into the following categories (refer to our documentations for more details):

If your data might contain these characters, it is crucial to understand these differences to avoid unexpected results in data comparison and sorting.

Mixed Use of Locale-aware and Locale-agnostic Collations: It is not recommended to combine locale-aware and locale-agnostic collations, such as ‘en-upper’. Although upper and lower are designed to improve performance, adding a language specifier like ‘en’ introduces unnecessary overhead by involving collation key lookup processes. However, combining upper or lower with other specifications, such as trim, is perfectly valid and can further refine your collation behavior.

Conclusion

The introduction of upper and lower collations in Snowflake provides more performant options for case-insensitive string data handling. Visit our documentation for more information, particularly regarding the behavioral differences between these collations.

Happy Collating!

--

--