Go Walkthrough: encoding package

Ben Johnson
Aug 15, 2016 · 8 min read

So far we’ve covered working with and but few applications simply shuttle bytes around. Bytes alone don’t convey much meaning, however, once we encode data structures on top of those bytes then we can build truly useful applications.

This post is part of a series of walkthroughs to help you understand the Go standard library better. While generated documentation provides a wealth of information, it can be difficult to understand packages in a real world context. This series aims to provide context of how standard library packages are used in every day applications. If you have questions or comments you can reach me at on Twitter.

What is encoding exactly?

In computer science we have fancy words for simple concepts. Not only that but many times there are lots of fancy words for a single concept. Encoding is one of those words. Sometimes it’s referred to as serialization or as marshaling — it means the same thing: adding a logical structure to raw bytes.

In the Go standard library, we use the term encoding and marshaling for two separate but related ideas. An encoder in Go is an object that applies structure to a stream of bytes while marshaling refers to applying structure to bounded, in-memory bytes.

For example, the encoding/json package has a json. and json. for working with io. and io. streams, respectively. The package also has json. and json. for writing to and reading from byte slices.

Two types of encoding

There is also another important distinction in encodings. Some encoding packages operate on primitives — strings, integers, etc. Strings are encoded with character encodings such as ASCII or or any number of other . Integers can be encoded differently based on or by using variable length encoding. Even bytes themselves are often encoded using schemes like to convert them into printable characters.

Often when we think of encoding, though, we think of object encoding. This refers to converting complex structures such as structs, maps, and slices into a series of bytes. There are a lot of tradeoffs when doing this conversion and over the years.

Making trade-offs

Converting logical structures to bytes seems simple enough at first —these structures are already represented in-memory as bytes internally. Why not just use that format?

There’s a lot of reasons why Go’s in-memory format isn’t suitable for converting to bytes and saving to disk or sending over the network. First is compatibility. Go’s internal data structure format doesn’t match Java’s internal format so we can’t communicate between these different systems. Sometimes we need compatibility not with another programming language but with humans. , , and are all human-readable formats that can be easily viewed and edited.

Making formats human-readable introduces a trade-off though. Formats that are easy for humans to parse are slower for computers to parse. Integers are a good example — people read in base-10 format whereas computers operate in base-2. People also read variable length numbers such as 1 or 1,000 but computers operate on fixed-sized numbers such as 32-bit or 64-bit integers. The performance difference may seem trivial for a single number but it quickly becomes a big deal when parsing millions or billions of numbers.

There’s also other trade-offs we don’t think of at first. Our data structures change over time but we still need to operate on bytes that may have been encoded years ago. Some encodings, such as , allow you to write a schema for your data and version your fields — older fields can be deprecated while new fields can be added. The downside of this is that you need the schema definition in order to encode and decode objects. Go’s own format takes a different approach and actually includes the schema format when encoding. However, the downside of this approach is that the encoded size can be much larger.

Some formats throw caution to the wind entirely and go schema-less. and both allow you to encode structures on the fly but provide no guarantees about safely decoding structures from an older format.

We also use systems that do encoding for us but we don’t think of as encoding. Databases, for example, are a roundabout way of taking our logical data structures and eventually persisting them as bytes on disk. It may involve network calls, SQL parsing, and query planning but it’s all essentially encoding.

Finally, if you really need speed above all else, you could use Go’s internal format to save data. I even wrote a library for this called . It’s encoding and decoding time is literally zero seconds. Should you use it in production? Probably not.

The 4 interfaces of encoding

If you are one of the few people who has ever looked at the package, you may have been underwhelmed. It is the second smallest package after the package and it only includes 4 interfaces.

The first two interfaces are and :

type BinaryMarshaler interface {
MarshalBinary() (data [], err )
}
type BinaryUnmarshaler interface {
UnmarshalBinary(data [])
}

These are for objects that provide a way to convert to and from a binary format. This is used in a few spots in the standard library such as time..(). You don’t find it more places because there’s not usually a single defined way to marshal an object to binary format. As we’ve seen, there are a multitude of serialization formats.

At the application level, however, you have probably picked a single format for marshaling. For instance, you may have chosen Protocol Buffers for all your data. There’s is typically no reason to support multiple binary formats for your application data so implementing can make sense.

The next two interfaces are and :

type TextMarshaler interface {
MarshalText() (text [], err )
}
type TextUnmarshaler interface {
UnmarshalText(text [])
}

These two interfaces are similar to the binary marshaling interfaces except that they output in a UTF-8 format.

Some formats have their own marshaling interfaces, such as json., which follow the same naming style.

Overview of encoding packages

There are a lot of useful encoding packages baked into the standard library. We’ll cover these in more detail in future posts but I’d like to give an overview first. Some of these are subpackages of encoding while others are scattered in different locations.

Primitive encodings

The first package you probably used when you started with Go is the package (pronounced “fumpt”). It uses C-style () conventions to encode and decode numbers, strings, bytes, and even includes limited support for object encoding. The package is a great, simple way to build human-readable strings from templates but the template parsing can add overhead.

If you need better performance then you can avoid templating by using the string conversion package — . This low-level package provides basic formatting and scanning for strings, integers, floats, and booleans and is generally pretty fast.

These packages, along with Go itself, assume that you’re encoding strings using UTF-8. The near total lack of non-Unicode character encoding support in the standard library could be because the Internet has quickly converged on a standard of UTF-8 over the last several years or it could be because is a coauthor of Go & UTF-8. Who knows? I’ve been lucky enough to not have to deal with any non UTF-8 encodings in Go so far, however, there is some encoding support in , , and the package tree. The “x” package tree contains a wealth of awesome packages that are part of the Go project but are not covered under the .

For integer encoding, the package provides big endian and little endian encodings as well as variable length encodings. Endianness refers to the order that bytes are written to disk. For example, the uint16 representation of 1,000 (which is 0x03E8 in hex) is composed of 2 bytes: 03 & E8. With big endian encoding, the bytes are written in that order “03 E8”. In little endian, the order is reversed: “E8 03”. Many common CPUs architectures use little endian. However, big endian is typically used when sending bytes over the network. Big endian is even called network byte order.

Finally, for byte encoding there are a couple packages available. Byte encoding is typically used to convert bytes into a printable format. The package, for example, can be used if you need to view binary data in hexidecimal format. I’ve personally only used it for debugging purposes. On the other hand, sometimes you need a printable format because you need to transport data over protocols with historically limited binary support (such as email). The and packages are an example of this. Another example is the package which is used for encoding TLS certificates.

Object encodings

We find fewer packages within the standard library for object encodings. However, in practice, these packages are many times all we need.

In case you’ve been living under a rock for the past decade, you’ve probably noticed that has become the default object encoding of the Internet. As mentioned above, JSON has its flaws but it’s easy to use and it has library support in every language so adoption has skyrocketed. The package provides great support for this protocol and there are also third party implementations for generating faster parsers such as .

While JSON has dominated as a protocol between machines, the format is a more common protocol for exporting data to humans. The package provides a good interface for exporting tabular data in this format.

If you’re interacting with a system built circa 2000 then you probably need to use . The package provides a -style interface with an additional tag-driven marshaler/unmarshaler that’s similar to the package. If you’re looking for more complex features like DOM, XPath, XSD, or XSLT then you should probably use via cgo.

Go also has its own stream encoding called . This package is used by the package for implementing a remote procedure call interface between two Go services. Gob is easy to use, however, it does not have any cross language support. seems to be a popular alternative if you need to communicate between different languages.

Finally, there’s a package called . There’s limited information in the documentation and the only link in the package points to a which is a 25 page wall of text. ASN.1 is a complex object encoding scheme that is most notably used by X.509 certificates in SSL/TLS.

Conclusion

Encoding provides the fundamental basis for layering information on top of our bytes. Without it we wouldn’t have strings or data structures or databases or any useful applications. What seems like a relatively simple concept has a rich history of implementations and a wide variety of tradeoffs.

In this post we looked at an overview of the different encoding implementations within the standard library and some of their tradeoffs. We saw how these primitive and object encoding packages built on our knowledge of byte streams and slices. In the next several posts we’ll take a deeper dive into these packages to see how to use them in a real world context.

Love the post? Hate it? Drop me a line at on Twitter.

Go Walkthrough

A series of walkthroughs to help you understand the Go standard library better.

Ben Johnson

Written by

Writing databases and distributed systems in Go.

Go Walkthrough

A series of walkthroughs to help you understand the Go standard library better.