In-depth introduction to bufio.Scanner in Golang

Michał Łowicki
golangspec

--

Go is shipped with package helping with buffered I/O — technique to optimize read or write operations. For writes it’s done by temporary storing data before transmitting it further (like disk or socket). Data is stored till certain size is reached. This way less write actions are triggered and each boils down to syscall which might be expensive when done frequently. For reads it means retrieving more data during single operation. It also reduces number of sycalls but can also uses underlaying hardware in more efficient way like reading data in disk blocks. This post focuses on Scanner provided by bufio package. It helps to process stream of data by splitting it into tokens and removing space between them:

"foo  bar   baz"

If we’re are interested only in words then scanner helps retrieving “foo”, “bar” and “baz” in sequence (source code):

package mainimport (
"bufio"
"fmt"
"strings"
)
func main() {
input := "foo bar baz"
scanner := bufio.NewScanner(strings.NewReader(input))
scanner.Split(bufio.ScanWords)
for scanner.Scan() {
fmt.Println(scanner.Text())
}
}

Output:

foo
bar
baz

Scanner uses buffered I/O while reading the stream — it takes io.Reader as an argument.

If you’re dealing with data in memory like string or slice of bytes then first check utilities like bytes.Split, strings.Split. It’s probably simpler to rely on those or others goodies from bytes or strings package when not working with data stream.

Under the hood scanner uses buffer to accumulate read data. When buffer is not empty or EOF has been reached then split function (SplitFunc) is called. So far we’ve seen one of pre-defined split functions but it’s possible to set anything with signature:

func(data []byte, atEOF bool) (advance int, token []byte, err error)

Split function is called with data read so far and basically can behave in 3 different ways — distinguished by returned values…

1. Give me more data!

It says that passed data is not enough to get a token. It’s done by returning 0, nil, nil. When it happens, scanner tries to read more data. If buffer is full then will double it before any reading. Let’s see how it works (source code):

package mainimport (
"bufio"
"fmt"
"strings"
)
func main() {
input := "abcdefghijkl"
scanner := bufio.NewScanner(strings.NewReader(input))
split := func(data []byte, atEOF bool) (advance int, token []byte, err error) {
fmt.Printf("%t\t%d\t%s\n", atEOF, len(data), data)
return 0, nil, nil
}
scanner.Split(split)
buf := make([]byte, 2)
scanner.Buffer(buf, bufio.MaxScanTokenSize)
for scanner.Scan() {
fmt.Printf("%s\n", scanner.Text())
}
}

Output:

false	2	ab
false 4 abcd
false 8 abcdefgh
false 12 abcdefghijkl
true 12 abcdefghijkl

The above split function is very simple and greedy — always requesting for more data. Scanner will try to read more but also making sure that buffer has enough space. In our case we’re starting with buffer of size 2:

buf := make([]byte, 2)
scanner.Buffer(buf, bufio.MaxScanTokenSize)

After split function is called for the very first time, scanner will double the size of the buffer, read more data and will call split function for the 2nd time. After 2nd call the scenario will be exactly the same. It’s visible in the output — first call of split gets slice of size 2, then 4, 8 and finally 12 since there is no more data.

Default size of buffer is 4096.

It’s worth to discuss atEOF parameter here. Designed to pass information to split function that no more data will be available. It can happen either while reaching EOF or if read call returns an error. If any of these happens then scanner will never try to read anymore. Such flag can used f.ex. to return error (because of incomplete token) which will cause scanner.Split() to return false and stop the whole process. Error can be later checked using Err method (source code):

package mainimport (
"bufio"
"errors"
"fmt"
"strings"
)
func main() {
input := "abcdefghijkl"
scanner := bufio.NewScanner(strings.NewReader(input))
split := func(data []byte, atEOF bool) (advance int, token []byte, err error) {
fmt.Printf("%t\t%d\t%s\n", atEOF, len(data), data)
if atEOF {
return 0, nil, errors.New("bad luck")
}
return 0, nil, nil
}
scanner.Split(split)
buf := make([]byte, 12)
scanner.Buffer(buf, bufio.MaxScanTokenSize)
for scanner.Scan() {
fmt.Printf("%s\n", scanner.Text())
}
if scanner.Err() != nil {
fmt.Printf("error: %s\n", scanner.Err())
}
}

Output:

false	12	abcdefghijkl
true 12 abcdefghijkl
error: bad luck

Parameter atEOF can be also used to process what is left inside buffer. One of pre-defined split functions which scans input line by line behaves exactly this way. For input like:

foo
bar
baz

there is no \n at the end of last line so when function ScanLines cannot find new line character it will simply return remaining characters as the last token (source code):

package mainimport (
"bufio"
"fmt"
"strings"
)
func main() {
input := "foo\nbar\nbaz"
scanner := bufio.NewScanner(strings.NewReader(input))
// Not actually needed since it’s a default split function.
scanner.Split(bufio.ScanLines)
for scanner.Scan() {
fmt.Println(scanner.Text())
}
}

Output:

foo
bar
baz

2. Token found

This happens when split function was able to detect a token. It returns the number of characters to move forward in the buffer and the token itself. The reason to return two values is simply because token doesn’t have to be always equal to the number of bytes to move forward. If input is “foo foo foo” and when goal is to detect words (ScanWords), then split function will also skip over spaces in between:

(4, "foo")
(4, "foo")
(3, "foo")

Let’s see it in action. This function will look only for contiguous strings foo (source code):

package mainimport (
"bufio"
"bytes"
"fmt"
"io"
"strings"
)
func main() {
input := "foofoofoo"
scanner := bufio.NewScanner(strings.NewReader(input))
split := func(data []byte, atEOF bool) (advance int, token []byte, err error) {
if bytes.Equal(data[:3], []byte{'f', 'o', 'o'}) {
return 3, []byte{'F'}, nil
}
if atEOF {
return 0, nil, io.EOF
}
return 0, nil, nil
}
scanner.Split(split)
for scanner.Scan() {
fmt.Printf("%s\n", scanner.Text())
}
}

Output:

F
F
F

3. Error

If split function returns an error then scanner stops (source code):

package mainimport (
"bufio"
"errors"
"fmt"
"strings"
)
func main() {
input := "abcdefghijkl"
scanner := bufio.NewScanner(strings.NewReader(input))
split := func(data []byte, atEOF bool) (advance int, token []byte, err error) {
return 0, nil, errors.New("bad luck")
}
scanner.Split(split)
for scanner.Scan() {
fmt.Printf("%s\n", scanner.Text())
}
if scanner.Err() != nil {
fmt.Printf("error: %s\n", scanner.Err())
}
}

Output:

error: bad luck

There is one special error which doesn’t stop the scanner immediately….

ErrFinalToken

Scanner offers an option to signal so-called final token. It’s a special token which doesn’t break the loop (Scan still returns true) but subsequent calls to Scan will stop immediately (source code):

func (s *Scanner) Scan() bool {
if s.done {
return false
}
...

Proposed in #11836 and can be used to stop scanning when finding special token (source code):

package mainimport (
"bufio"
"bytes"
"fmt"
"strings"
)
func split(data []byte, atEOF bool) (advance int, token []byte, err error) {
advance, token, err = bufio.ScanWords(data, atEOF)
if err == nil && token != nil && bytes.Equal(token, []byte{'e', 'n', 'd'}) {
return 0, []byte{'E', 'N', 'D'}, bufio.ErrFinalToken
}
return
}
func main() {
input := "foo end bar"
scanner := bufio.NewScanner(strings.NewReader(input))
scanner.Split(split)
for scanner.Scan() {
fmt.Println(scanner.Text())
}
if scanner.Err() != nil {
fmt.Printf("Error: %s\n", scanner.Err())
}
}

Output:

foo
END

Both io.EOF and ErrFinalToken aren’t considered to be “true” errors — Err method will return nil if any of these two caused scanner to stop.

Maximum token size / ErrTooLong

By default maximum length of buffer which is used underneath is 64 * 1024 bytes. It means that found token cannot be longer than this limit (source code)

package mainimport (
"bufio"
"fmt"
"strings"
)
func main() {
input := strings.Repeat("x", bufio.MaxScanTokenSize)
scanner := bufio.NewScanner(strings.NewReader(input))
for scanner.Scan() {
fmt.Println(scanner.Text())
}
if scanner.Err() != nil {
fmt.Println(scanner.Err())
}
}

Program prints bufio.Scanner: token too long. This limit can be set with method Buffer which allows also to pass custom buffer. We’ve already seen that in Give me more data! section but let’s see some smaller example (source code):

buf := make([]byte, 10)
input := strings.Repeat("x", 20)
scanner := bufio.NewScanner(strings.NewReader(input))
scanner.Buffer(buf, 20)
for scanner.Scan() {
fmt.Println(scanner.Text())
}
if scanner.Err() != nil {
fmt.Println(scanner.Err())
}

Output:

bufio.Scanner: token too long

Protecting against endless loop

Couple of years back #8672 has been reported. Patch for it added one more scenario when split function could be called — atEOF is true and buffer is empty. Existing code could fall into an infinite loop:

package mainimport (
"bufio"
"bytes"
"fmt"
"strings"
)
func main() {
input := "foo|bar"
scanner := bufio.NewScanner(strings.NewReader(input))
split := func(data []byte, atEOF bool) (advance int, token []byte, err error) {
if i := bytes.IndexByte(data, '|'); i >= 0 {
return i + 1, data[0:i], nil
}
if atEOF {
return len(data), data[:len(data)], nil
}
return 0, nil, nil
}
scanner.Split(split)
for scanner.Scan() {
if scanner.Text() != "" {
fmt.Println(scanner.Text())
}
}
}

Split function assumes that when atEOF is true then can safely use rest of the buffer as the token. The problem is that after fix to #8672 that buffer can be empty so split function won’t advance buffer — returning (0, [], nil). #9020 detects such case and panics (source code):

foo
bar
panic: bufio.Scan: 100 empty tokens without progressing

When I first read documentation of Scanner or SplitFunc not everything was clear about how it works in all cases I had in my mind. Jumping into the source code didn’t helped neither since Scan is rather complicated at first glance. Hopefully this post will make things clearer for others.

👏 below to help others discover this story. Please follow me if you want to get updates about new posts or boost work on future stories.

--

--

Michał Łowicki
golangspec

Software engineer at Datadog, previously at Facebook and Opera, never satisfied.