String Data Type in Go
Strings in Go deserve special attention because they are implemented very differently in Go compared to other languages.
Strings are defined between double quotes
UTF-8 encoded by default which makes more sense in 21st century. As UTF-8 supports
ASCII character set, you don't need to worry about encoding in most of the cases. But to understand how UTF-8 encoding works, you should definitely visit this wikipedia page.
Let’s write a simple program. To define an empty variable of string type, use
string keyword. Check out earlier tutorials on how to declare a variable.
To find length of a string, you can use
len function is made available in Go runtime, hence you don’t need to import it from any package.
lenis a universal function to find length of any data type, it’s not particular to strings. We will learn about more Go’s built-in functions in upcoming tutorials.
Which will print
11 to the console as string
11 characters including a space which is also a character. All characters in string
Hello World are valid ASCII characters, hence we hope to see each character to occupy only a byte in memory (as ASCII characters in UTF-8 occupies 8 bits or 1 byte). Let's see that using
for loop on a string.
Woha! I guess you were expecting
s[i] to be letter in
s string where
index of the character in the string starting from
0. Then what is this? Well, these are the decimal value of ASCII/UTF-8 characters in Hello World string (see table http://www.asciichart.com).
In Go, a string is in effect a read-only slice of bytes. For now, imagine
slice is like a simple
array, we will learn about slices in upcoming lessons. Hence in the above case, we are seeing the byte (
uint8) values of string
s which is internally a
s[i] prints the decimal value of the byte held by the character. But to see individual characters, you can use
%c format string in
Printf statement. You can also use
%v format string to see byte value and
%T to see data type of the value.
So you can see each letter shows decimal number which holds
8 bits or
1 byte of memory in type
As we know (read wikipedia page), UTF-8 character can be defined in memory size from 1 byte (ASCII compatible) to 4 bytes. Hence in Go, all characters are represented in
int32 (size of 4 bytes) data type. A
code unit is the number of bits an encoding uses for one single unit cell. So UTF-8 uses 8 bits and UTF-16 uses 16 bits for a
code unit, that means UTF-8 needs minimum 8 bits or 1 byte to represent a character.
code point is any numerical value that defines the character and this is represented by one or more code units depending on the encoding. As UTF-8 is compatible with ASCII, all ASCII characters are represented in a single byte (8 bits), hence UTF-8 needs only 1 code unit to represent them.
But the biggest question is, if all characters in UTF-8 are represented in
int32, then why we are getting
uint8 type in the above example. As said earlier, in Go, a string is a read-only slice of bytes. When we use
len function on a string, it calculates the length of that
slice. When we use
for loop, it loops around the slice returning one byte at a time or one
code unit at a time. As so far, all our characters were in ASCII character set, the byte provided by for loop was a valid character or a code unit was, in fact, a code point. Hence
Printf statement could print valid a character from that byte value. But as we know, UTF-8
code point or character value can be represented by series of one or more bytes (max 4 bytes), what will happen in for loop we saw earlier if we introduce non-ASCII characters?
õ (LATIN SMALL LETTER O WITH TILDE, http://www.utf8-chartable.de) which has Unicode code point representation
U+00F5 and it is represented by 2 code units (2 bytes)
c3 b5 (hexadecimal representation). So instead of
6f for character
o, we should expect
c3 b5 for character
From the above result, we got
c3 b5 instead of
6f but characters of
Hellõ World did not get printed very well. We also see that
len counts the number of bytes in a string and that caused this problem. As indexing a string (using for loop on it) accesses individual bytes, not characters. Hence
decimal 195) in UTF-8 represents
decimal 181) represents
µ (check here).
To avoid the above the chaos, Go introduces data type
rune (synonym of
code point) which is an alias of
int32 and I told you (but not proved yet) that Go represents a character (code point) in
int32 data type.
Interesting answer on why
code pointvalue can not be negative and
int32data type can hold both negative and positive values) is here.
So, instead of a slice of bytes, we need to convert a string into a slice of runes.
We converted a string into a slice of runes using type conversion. Observe
f5 in the above result instead of
c3 b5 because we are iterating over
rune data type and
code point of
õ in UTF-8 table is
f5 (hence unicode code point representation
U+00F5) or decimal
245 (check here). Also, we got the length
11 of string
s which is correct, because there are 11 runes in the slice (or 11 code points or 11 characters). And we also proved that a code point or a character in Go is represented by
int32 data type.
for loop on
If you use
range will return
rune and byte index of the character.
In the above program, we lost index
5 because the 5th byte is second
code unit of
õ character. If you don’t need
index value, you can ignore it by using
_ (blank identifier) instead.
Strings are a slice of bytes, simple as that. When we use
for loop with
range, we get
rune because each character in the string is represented by
rune data type. In Go, a character is represented between single quote AKA character literal. Hence, any valid UTF-8 character within a single quote (
') is a
rune and it’s type is
The above program will print
f5 245 int32 which is hexadecimal/decimal value and data type of
code point value of
õ in UTF-8 table.
Strings are immutable
As seen from the earlier definition of strings, they are a read-only slice of bytes. Hence, if we try to replace any byte in the slice, the compiler will throw an error.
Above program will not compile and the compiler will throw an error,
cannot assign to s as the string
s is a read-only slice of bytes.
String literals using backtick
Instead of double quotes, we can also use backtick (`) character to represent a string in Go. In quotes (“) you need to escape newlines, tabs and other characters that do not need to be escaped in backticks. If you put a line break in a backtick string, it is interpreted as a ‘\n’ character, see https://golang.org/ref/spec#String_literals
The value of a raw string literal is the string composed of the uninterpreted (
implicitly UTF-8-encoded) characters between the backticks; in particular, backslashes have no special meaning and the string may contain newlines. Carriage return characters (
\r) inside raw string literals are discarded from the raw string value. - GoLang documentation
Let’s see a small example
We can see that original formatting of the string with newline, tab and double quotes persisted in the output and newline character
\n did nothing while carriage return
\r was discarded.
As character represented in single quotes in Go is
rune and rune can be compared because they represent Unicode code points (
int32 values). Hence if a character has more decimal value, it will be greater than the character which has lower.
Let’s see a very simple example.
int32 value of
b is greater than
a, the expression
'b' > 'a' will be true. Let's see another example.
Since we know that characters are nothing but
int32 internally, we can do all sorts of comparisons with them. For example, a
for loop between two character value range.