Go code is UTF-8 encoded

Michał Łowicki
golangspec
2 min readFeb 6, 2017

--

Unicode plays quite nicely with Golang. Programmer doesn’t have to mark unicode strings in any special way:

package mainimport "fmt"func main() {
fmt.Println("嗨")
fmt.Println("Cześć")
fmt.Println("Hello")
}

so that code prints what is expected:


Cześć
Hello

Program above uses interpreted string literals (between double quotes). The same encoding properties apply to raw strings literals which are encompassed by back quotes (code):

package mainimport "fmt"func main() {
fmt.Println(`嗨
Cześć\n\n\n
Hello`)
}

Output:


Cześć\n\n\n
Hello

Raw string literals can contain newlines, backslashes aren’t interpreted in any special way and carriage return characters (\r) are discarded.

What is less know is that the whole content in .go files is encoded in UTF-8. Taking into account also the fact that identifier is made up of letters and digits (where first is always a letter) and letter is an arbitrary Unicode code point then:

package mainimport "fmt"func 隨機名稱() {
fmt.Println("It works!")
}
func main() {
隨機名稱()
źdźbło := 1
fmt.Println(źdźbło)
}

is completely valid program producing:

It works!
1

Besides strings literals or identifiers, non-ASCII characters can show up also in comments:

name := "Michał" // Michał == Michael
fmt.Println(name)

The implementation of Golang’s lexer (aka scanner or tokenizer) is placed inside cmd/compile/internal/syntax/scanner.go.

play.golang.org doesn’t interpret \r so program:

package mainimport "fmt"func main() {
fmt.Printf("foo")
fmt.Printf("\r")
fmt.Printf("bar\n")
}

gives there:

foo
bar

GNU Bash handles it correctly though:

> go run foo.go
bar

--

--

Michał Łowicki
golangspec

Software engineer at Datadog, previously at Facebook and Opera, never satisfied.