Python: Output formatting double byte characters

呉
3 min readMar 16, 2018

--

Everything below still stands as of Python 3.9/10/11

But I updated the code below to be compatible with current Python rules

Gist with updated version is here:

If you want to print out some data to the terminal and have it perfectly align and formatted with double byte characters (aka Chinese, Japanese or Korean) it will pretty much be impossible with the standard format in Python as this goes on the length of the string but not on the actual width of the string.

|Some string 123 other text         |
|Some string 日本語 other text |

Yeah, that is not aligned at all when you shorten it. As it is clearly visible that one character 日 takes two normal character 12.

So how do we fix that? Well first we need to find the actually width length of the string. Once we know that we can subtract that from the format width to get it in there. But we also need to shorten the string because it is highly likely it will be longer than the format given length.

With these three functions below we can achieve that (needs uncodedata import)

def shortenStringCJK(string, width, placeholder='..'):
# get the length with double byte charactes
string_len_cjk = stringLenCJK(str(string))
# if double byte width is too big
if string_len_cjk > width:
# set current length and output string
cur_len = 0
out_string = ''
# loop through each character
for char in str(string):
# set the current length if we add the character
cur_len += 2 if unicodedata.east_asian_width(char) in "WF" else 1
# if the new length is smaller than the output length to shorten too add the char
if cur_len <= (width - len(placeholder)):
out_string += char
# return string with new width and placeholder
return "{}{}".format(out_string, placeholder)
else:
return str(string)
def stringLenCJK(string):
# return string len including double count for double width characters
return sum(1 + (unicodedata.east_asian_width(c) in "WF") for c in string)
def formatLen(string, length):
# returns length udpated for string with double byte characters
# get string length normal, get string length including double byte characters
# then subtract that from the original length
return length - (stringLenCJK(string) - len(string))

stringLenCJK

This will return the width size of the string. For each double byte character it counts two for length

shortenStringCJK

Checks if the length is larger than the needed length and then loops through the string and creates a new output string that is in width length smaller than the needed length. Again, each double byte character counts as two.

formatLen

Calculates the new length for the format part based on the string that we want to output. After we shortened the string.

The correct print would be then

format_str = "|{{:<{len}}}|"
format_len = 26
string_len = 26
print("Normal : {}".format(
format_str.format(
len=formatLen(shortenStringCJK(_string, width=string_len), format_len))
).format(
shortenStringCJK(_string, width=string_len)
)
)

So we set the output format string with two nested replacements. First we check the shortened string Length to the wanted output length and get the correct format width.

Then we shorten the string and add it to the output string

Full example (see gist above)

This will output the following:

Original string
Normal (CJK len 26/len 26): |Some string 123 other text|
Normal (CJK len 29/len 26): |Some string 日本語 other text|
Normal (CJK len 30/len 26): |日本語は string 123 other text|
Normal (CJK len 52/len 26): |あいうえおかきくけこさしすせそなにぬねのまみむめも〜|
Normal (CJK len 39/len 26): |あいうえおかきくけこさしす 1 other text|
Normal (CJK len 40/len 26): |Some string すせそなにぬねのまみむめも〜|
Normal (CJK len 58/len 58): |SOME OTHER STRING THAT IS LONGER THAN TWENTYSIX CHARACTERS|
Shorten string
Calculate> format_len: 26, string_len: 26, stringLenCJK(short) 26, len(short) 26, new format_len: 26
Calculate> format_len: 26, string_len: 26, stringLenCJK(short) 26, len(short) 23, new format_len: 23
Calculate> format_len: 26, string_len: 26, stringLenCJK(short) 26, len(short) 22, new format_len: 22
Calculate> format_len: 26, string_len: 26, stringLenCJK(short) 26, len(short) 14, new format_len: 14
Calculate> format_len: 26, string_len: 26, stringLenCJK(short) 26, len(short) 14, new format_len: 14
Calculate> format_len: 26, string_len: 26, stringLenCJK(short) 26, len(short) 20, new format_len: 20
Calculate> format_len: 26, string_len: 26, stringLenCJK(short) 26, len(short) 26, new format_len: 26
Normal : |Some string 123 other text|
Normal : |Some string 日本語 other..|
Normal : |日本語は string 123 othe..|
Normal : |あいうえおかきくけこさし..|
Normal : |あいうえおかきくけこさし..|
Normal : |Some string すせそなにぬ..|
Normal : |SOME OTHER STRING THAT I..|

[Except that in medium the font is not monospaced]

Output on the Terminal

--

--

呉

日本で住んでいる。写真大好き。焼酎も大好き。ヤバイ生活!