SELECT LENGTH(β€˜πŸ™†β€β™€οΈβ€™)= 4. Why does this return an incorrect result?

Maeda Kentaro
Snowflake Engineering
2 min readJan 18, 2024

The LENGTH Function Fails to Accurately Count Certain Emojis and Characters (Grapheme Clusters).

The LENGTH Function Fails to Accurately Count Certain Emojis and Characters

Solution

When accuracy is crucial, counting with a UDF in Python or similar is necessary. Create a UDF that correctly counts characters, as follows

CREATE OR REPLACE FUNCTION COUNT_GC(s varchar)
returns number
language python
runtime_version = 3.11
packages = ('pyicu')
handler = 'count_grapheme_clusters'
as $$
import icu
def count_grapheme_clusters(s):
iter = icu.BreakIterator.createCharacterInstance(icu.Locale("en_US"))
iter.setText(s)
return sum(1 for _ in iter)
$$;

Experiment

Tested to count characters in emojis and Korean text using the LENGTH function, and the query returned inaccurate values.

OCTET_LENGTH is typically perceived as consuming 1 byte for each half-width character, 3 bytes for each full-width character (like in Japanese), and 4 bytes for each emoji. However, this understanding falls short when dealing with grapheme clusters.

Only when counted with UDF, the correct results are obtained.

And our Polar Bear emoji is counted as four characters!

SELECT LENGTH('🐻')=1
SELECT LENGTH('πŸ»β€β„οΈ')=4

Why Does This Happen?

This is because functions fail to correctly count grapheme clusters. In Unicode, there are characters created from multiple Unicode characters (code points). This list of code points is called grapheme clusters.

  • β€œπŸ™†β€ 1 code point
  • β€œU+200D” 1 code point
  • β€œβ™€β€ 1 code point
  • β€œU+FE0F” 1 code point
  • => πŸ™†β€β™€οΈ 4 code points

According to Unicode standards, grapheme clusters should be displayed as one character and counted as one character. [1]

When programmers need to provide users with character counts, the counts should correspond to the number of segments delimited by grapheme cluster boundaries.

However, few frameworks can handle this without additional implementation. Even Python 3.11’s len function outputs based on the number of code points.

Currently, to accurately count the number of characters in columns that include emojis or foreign languages, a UDF is necessary.

Conclusion

When counting characters in strings, be mindful of whether grapheme clusters are included.

--

--

Maeda Kentaro
Snowflake Engineering

RAKUDEJI inc. CEO | SnowPro Advanced : Data Engineer❄️