GOLANG

String Data Type in Go

Strings in Go deserve special attention because they are implemented very differently in Go compared to other languages.

Uday Hiwarale
Aug 5, 2018 · 8 min read

Strings are defined between double quotes "..." and not single quotes, unlike JavaScript. Strings in Go are UTF-8 encoded by default which makes more sense in the 21st century.

As UTF-8 supports ASCII character set, you don’t need to worry about encoding in most of the cases. But to understand how UTF-8 encoding works, you should definitely visit my article on Character Encoding.

Let’s write a simple program. To define an empty variable of string type, use string keyword. Check out earlier tutorials on how to declare a variable.

Image for post
Image for post
https://play.golang.org/p/vMDoeaV3RCY

To find the length of a string, you can use len function. The len function is available in Go runtime, hence you don’t need to import it from any package.

💡 The len is a universal function to find length of any data type, it’s not exclusive for strings. We will learn about more Go’s built-in functions in upcoming tutorials.

Image for post
Image for post
https://play.golang.org/p/Kqj-TJMFyXP

In the above program, len(s) will print 11 to the console as the string s has 11 characters including a space character.

All characters in string Hello World are valid ASCII characters, hence we hope to see each character to occupy only a byte in memory (as ASCII characters in UTF-8 occupies 8 bits or 1 byte).

Let’s verify that using a for loop on the string s.

Image for post
Image for post
https://play.golang.org/p/cE32NenaYmN

Woha! I guess you were expecting s[i] to be a letter in s string where i is index of the character in the string starting from 0. Then what is this? Well, these are the decimal value of ASCII/UTF-8 characters in Hello World string (see table http://www.asciichart.com).

In Go, a string is in effect a read-only slice of bytes. For now, imagine slice is like a simple array. We will learn about slices in upcoming lessons.

In the above example, we are iterating over a slice of bytes (values of uint8 array). Hence s[i] prints the decimal value of the byte held by the character. But to see individual characters, you can use %c format string in Printf statement. You can also use %v format string to see the byte value and %T to see data type of the value.

Image for post
Image for post
https://play.golang.org/p/wwqhgHcTeIU

So you can see each letter shows a decimal number which holds 8 bits or 1 byte of memory in type uint8.

As we know (read wikipedia page), UTF-8 character can be defined in memory size from 1 byte (ASCII compatible) to 4 bytes. Hence in Go, all characters are represented in int32 (size of 4 bytes) data type. A code unit is the number of bits an encoding uses for one single unit cell. So UTF-8 uses 8 bits and UTF-16 uses 16 bits for a code unit, that means UTF-8 needs minimum 8 bits or 1 byte to represent a character.

A code point is any numerical value that defines the character and this is represented by one or more code units depending on the encoding. As UTF-8 is compatible with ASCII, all ASCII characters are represented in a single byte (8 bits), hence UTF-8 needs only 1 code unit to represent them.

But the biggest question is, if all characters in UTF-8 are represented in int32, then why we are getting uint8 type in the above example. As said earlier, in Go, a string is a read-only slice of bytes. When we use len function on a string, it calculates the length of that slice.

When we use for loop, it loops around the slice returning one byte at a time or one code unit at a time. As so far, all our characters were in the ASCII character set, the byte provided by for loop was a valid character or a code unit was, in fact, a code point.

Hence %c in Printf statement could print valid a character from that byte value. But as we know, UTF-8 code point or character value can be represented by series of one or more bytes (max 4 bytes), what will happen in for loop we saw earlier if we introduce non-ASCII characters?

Let’s replace o in Hello to õ (LATIN SMALL LETTER O WITH TILDE, http://www.utf8-chartable.de) which has Unicode code point representation U+00F5 and it is represented by 2 code units (2 bytes) c3 b5 (hexadecimal representation). So instead of 6f for character o, we should expect c3 b5 for character õ.

Image for post
Image for post
https://play.golang.org/p/rhueGpn4pDc

From the above result, we got c3 b5 instead of 6f but characters of Hellõ World did not get printed very well. We also see that len(s) returns 12 because len counts the number of bytes in a string and that caused this problem.

As indexing a string (using for loop on it) accesses individual bytes, not characters. Hence c3 (decimal 195) in UTF-8 represents à and b5 (decimal 181) represents µ (check here).

To avoid the above the chaos, Go introduces data type rune (synonym of code point) which is an alias of int32 and I told you (but not proved yet) that Go represents a character (code point) in int32 data type.

💡 Interesting answer on why rune is int32 and not uint32 (as character code point value can not be negative and int32 data type can hold both negative and positive values) is here.

So, instead of a slice of bytes, we need to convert a string into a slice of runes.

Image for post
Image for post
https://play.golang.org/p/ELgL-upVnz_r

We converted a string into a slice of runes using type conversion. Observe f5 in the above result instead of c3 b5.

This happened because while converting the string s to a slice of rune, c3 b5 got converted to f5 as c3 b5 collectively represents the character õ and code point of õ in UTF table is f5 (hence Unicode code point representation U+00F5) or decimal 245 (check here).

Also, we got the length 11 of string s which is correct, because there are 11 runes in the slice (or 11 code points or 11 characters). And we also proved that a code point or a character in Go is represented by int32 data type.


Using a for loop on a string

If you use range within a for loop, range will return rune and byte index of the character.

Image for post
Image for post
https://play.golang.org/p/Xet2cJbywLH

In the above program, we lost index 5 because the 5th byte is second code unit of õ character. If you don’t need index value, you can ignore it by using _ (blank identifier) instead.


What is a rune

A string is a slice of bytes or uint8 integers, simple as that. When we use for loop with range, we get rune because each character in the string is represented by rune data type.

In Go, a character can be represented between single quotes AKA character literal. Hence, any valid UTF-8 character within a single quote (') is a rune and its type is int32.

Image for post
Image for post
https://play.golang.org/p/QNBsDunKTrJ

The above program will print f5 245 int32 which is hexadecimal/decimal value and data type of code point value of õ in the UTF table.


Strings are immutable

As seen from the earlier definition of strings, they are a read-only slice of bytes. Hence, if we try to replace any byte in the slice, the compiler will throw an error.

Image for post
Image for post
https://play.golang.org/p/9Uu5LqNqVkb

The above program will not compile and the compiler will throw an error, cannot assign to s[0] as the string s is a read-only slice of bytes.

However, you can create a string from a slice of bytes and not only from a string literal. But once the conversion from slice to string is done, you can not modify the string as explained in the above example.

var1 := []uint8{72, 101, 108, 108, 111} // [72 101 108 108 111]
var2 := string(var1) // Hello

💡 Remember, byte is an alias for unit8 and rune is an alias for int32. Hence, you can use them interchangiably


String literals using backtick

Instead of double quotes, we can also use backtick (`) character to represent a string in Go. In quotes (“) you need to escape newlines, tabs and other characters that do not need to be escaped in backticks.

If you put a line break in a backtick string, it is interpreted as a ‘\n’ character, see https://golang.org/ref/spec#String_literals

💡 The value of a raw string literal is the string composed of the uninterpreted (implicitly UTF-8-encoded) characters between the backticks; in particular, backslashes have no special meaning and the string may contain newlines. Carriage return characters (\r) inside raw string literals are discarded from the raw string value. - GoLang documentation

Let’s see a small example

Image for post
Image for post
https://play.golang.org/p/9Ir-0Lxx0u3

We can see that original formatting of the string with newline, tab and double quotes persisted in the output and newline character \n did nothing while carriage return \r was discarded.


Character comparison

As character represented in single quotes in Go is rune and rune can be compared because they represent Unicode code points (int32 values). Hence if a character has more decimal value, it will be greater than the character which has lower.

Let’s see a very simple example.

Image for post
Image for post
https://play.golang.org/p/lxGiJzNeNWO

Since int32 value of b is greater than a, the expression 'b' > 'a' will be true. Let's see another example.

Image for post
Image for post
https://play.golang.org/p/aw8Sv8Vto-c

Since we know that characters are nothing but int32 internally, we can do all sorts of comparisons with them. For example, a for loop between two character-value range.

Image for post
Image for post
https://play.golang.org/p/kS4vxuSSmWg

This was a basic introduction to Strings in Go but there are many utility functions provided by strings package that can be used to perform all sorts of operations on string like join, replace, search, etc. The strings package is a part of Go’s standard library.


RunGo

A go-to guide for learning Go programming language

Uday Hiwarale

Written by

🦸‍♂️ Programmer { GO ‡ TS ‡ Dart ‡ Python } •👨‍🎓 IITI’14 • 👨‍💻 AVIZVA • 👨‍✈️ India┆GitHub ⇝ github.com/thatisuday | Email ⇝ thatisuday@gmail.com

RunGo

RunGo

A place to find introductory Go programming language tutorials and learning resources. In this publication, we will learn Go in an incremental manner, starting from beginner lessons with mini examples to more advanced lessons.

Uday Hiwarale

Written by

🦸‍♂️ Programmer { GO ‡ TS ‡ Dart ‡ Python } •👨‍🎓 IITI’14 • 👨‍💻 AVIZVA • 👨‍✈️ India┆GitHub ⇝ github.com/thatisuday | Email ⇝ thatisuday@gmail.com

RunGo

RunGo

A place to find introductory Go programming language tutorials and learning resources. In this publication, we will learn Go in an incremental manner, starting from beginner lessons with mini examples to more advanced lessons.

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store