SELECT LENGTH(βπββοΈβ)= 4. Why does this return an incorrect result?
The LENGTH Function Fails to Accurately Count Certain Emojis and Characters (Grapheme Clusters).
Solution
When accuracy is crucial, counting with a UDF in Python or similar is necessary. Create a UDF that correctly counts characters, as follows
CREATE OR REPLACE FUNCTION COUNT_GC(s varchar)
returns number
language python
runtime_version = 3.11
packages = ('pyicu')
handler = 'count_grapheme_clusters'
as $$
import icu
def count_grapheme_clusters(s):
iter = icu.BreakIterator.createCharacterInstance(icu.Locale("en_US"))
iter.setText(s)
return sum(1 for _ in iter)
$$;
Experiment
Tested to count characters in emojis and Korean text using the LENGTH function, and the query returned inaccurate values.
OCTET_LENGTH is typically perceived as consuming 1 byte for each half-width character, 3 bytes for each full-width character (like in Japanese), and 4 bytes for each emoji. However, this understanding falls short when dealing with grapheme clusters.
Only when counted with UDF, the correct results are obtained.
And our Polar Bear emoji is counted as four characters!
SELECT LENGTH('π»')=1
SELECT LENGTH('π»ββοΈ')=4
Why Does This Happen?
This is because functions fail to correctly count grapheme clusters. In Unicode, there are characters created from multiple Unicode characters (code points). This list of code points is called grapheme clusters.
- βπβ 1 code point
- βU+200Dβ 1 code point
- βββ 1 code point
- βU+FE0Fβ 1 code point
- => πββοΈ 4 code points
According to Unicode standards, grapheme clusters should be displayed as one character and counted as one character. [1]
When programmers need to provide users with character counts, the counts should correspond to the number of segments delimited by grapheme cluster boundaries.
However, few frameworks can handle this without additional implementation. Even Python 3.11βs len function
outputs based on the number of code points.
Currently, to accurately count the number of characters in columns that include emojis or foreign languages, a UDF is necessary.
Conclusion
When counting characters in strings, be mindful of whether grapheme clusters are included.