Understanding Go programs with go/parser
This blog post describes the same techniques used during episode 25 of justforfunc which you can watch right below.
Previously in justforfunc
In a previous post, we used the go/scanner
package in Go’s standard library to identify which was the most common identifier in the standard library itself.
Spoiler alert: it was v
.
In order to get a somehow more meaningful list, we limited the search to those identifiers that were three letters long or more. This gave us err
and nil
, which we’ve all seen before in the (in)famous if err != nil {
expression.
Variables: package vs local
What if I wanted to know which is the most common name for local variables? What about functions or types? In order to answer this question scanners fall short because they lack context. We know what tokens we saw before, but in order to know whether a var a = 3
is declared at package, function, or even block levels we need more context.
A package has many declarations, some of those declarations may be of functions which in time could declare local variables, constants, or even more functions!
But how do we find that structure from a sequence of tokens? Well, every single programming language has a set of rules that inform us on how to build such a tree from a sequence of tokens. This looks something like:
VarDecl = "var" ( VarSpec | "(" { VarSpec ";" } ")" ) .
VarSpec = IdentifierList ( Type [ "=" ExpressionList ] | "="
ExpressionList ) .
The rules above tell us that a VarDecl
(variable declaration) starts with a var
token, followed by either a VarSpec
(variable specification) or a list of them surrounded by parentheses and separated by semicolons.
Note: Those semicolons are actually added by the Go scanner, so you might not see them but the parser does.
If we start with a piece of Go code containing var a = 3
, using go/scanner
we could obtain the following list of tokens.
[VAR], [IDENT "a"], [ASSIGN], [INT "3"], [SEMICOLON]
The rules above help us figure out this is a VarDecl
with only one VarSpec
. Then we’ll parse an IdentifierList
with a single Identifier
a
, no Type
, and an ExpressionList
with an Expression
which is an integer with value 3
.
Or, as a tree, it would look like the image below.
The rules that allow us to go from a sequence of tokens to a tree structure form a language grammar, or syntax. The resulting tree is called an Abstract Syntax Tree, often simply AST.
Using go/scanner
Enough theory for now, let’s write some code! Let’s see how we can parse the expression var a = 3
and obtain an AST from it.
This program compiles (yay!) but if you run you’ll see the following error:
1:1: expected 'package', found 'var' (and 1 more errors)
Oh, yeah. In order to parse a declaration we are calling ParseFile
, which expects a full Go file therefore starting with package
before any other code (expect for comments).
If you were parsing an expression, such as 3 + 5
or other pieces of code which you could see as a value you could pass as a parameter, then you have ParseExpr
, but that function will not help with a function declaration.
Let’s simply add package main
at the beginning of our code and see the AST we obtain.
And when we run it we get something a bit better … just a bit, though.
$ go run main.go
&{<nil> 1 main [0xc420054100] scope 0xc42000e210 {
var a
}
[] [] []}
Let’s print more detail by replacing the Println
call by fmt.Printf("%#v", f)
and try again.
go run main.go
&ast.File{Doc:(*ast.CommentGroup)(nil), Package:1, Name:(*ast.Ident)(0xc42000a060), Decls:[]ast.Decl{(*ast.GenDecl)(0xc420054100)}, Scope:(*ast.Scope)(0xc42000e210), Imports:[]*ast.ImportSpec(nil), Unresolved:[]*ast.Ident(nil), Comments:[]*ast.CommentGroup(nil)}
Ok, that’s better but I’m too lazy to read this. Let’s import github.com/davecgh/go-spew/spew
and use it to print an easier to read value.
Running this program shows us pretty much the same as before, but in a much more readable format.
$ go run main.go
(*ast.File)(0xc42009c000)({
Doc: (*ast.CommentGroup)(<nil>),
Package: (token.Pos) 1,
Name: (*ast.Ident)(0xc42000a120)(main),
Decls: ([]ast.Decl) (len=1 cap=1) {
(*ast.GenDecl)(0xc420054100)({
Doc: (*ast.CommentGroup)(<nil>),
TokPos: (token.Pos) 15,
Tok: (token.Token) var,
Lparen: (token.Pos) 0,
Specs: ([]ast.Spec) (len=1 cap=1) {
(*ast.ValueSpec)(0xc4200802d0)({
Doc: (*ast.CommentGroup)(<nil>),
Names: ([]*ast.Ident) (len=1 cap=1) {
(*ast.Ident)(0xc42000a140)(a)
},
Type: (ast.Expr) <nil>,
Values: ([]ast.Expr) (len=1 cap=1) {
(*ast.BasicLit)(0xc42000a160)({
ValuePos: (token.Pos) 23,
Kind: (token.Token) INT,
Value: (string) (len=1) "3"
})
},
Comment: (*ast.CommentGroup)(<nil>)
})
},
Rparen: (token.Pos) 0
})
},
Scope: (*ast.Scope)(0xc42000e2b0)(scope 0xc42000e2b0 {
var a
}
),
Imports: ([]*ast.ImportSpec) <nil>,
Unresolved: ([]*ast.Ident) <nil>,
Comments: ([]*ast.CommentGroup) <nil>
})
I recommend spending some time actually reading this tree and seeing how each piece matches with the original code. Feel free to ignore Scope
, Obj
, and Unresolved
as we’ll talk about those later on.
Going from AST to code
It is sometime useful to print an AST into the corresponding source code rather than its tree form. To do so, we have the go/printer
package which you can easily use as a way to see what information is stored in the AST.
Executing this program prints the source code we parsed originally. It is now a good point to see how different values of the parsing mode affect what information is kept in the AST. Replace parser.AllErrors
with parser.ImportsOnly
or other possible values.
Navigating the AST
So far we have a tree that contains all of the information we want, but how do we extract the pieces of information we care about? This is where the go/ast
package comes in handy (other than for declaring the ast.File
that parser.ParseFile
returns!).
Let’s use ast.Walk
. This function receives two arguments. The second one is an ast.Node
which is an interface satified by all the nodes in the AST. The first argument is an ast.Visitor
which is also obviously an interface.
This interface has a single method.
Ok, so we already have a node, since the ast.File
returned by parser.ParseFile
satisfies the interface, but we still need to create an ast.Visitor
.
Let’s simply write one that prints the type of the node and returns itself.
Running this program gives us a sequence of nodes, but we’ve lost the tree structure. Also, what are all of those nil
nodes? Well, we should read the documentation of ast.Walk
! Turns out the Visitor
we return is used to visit each one of the children of the current node, and we end up by calling the Visitor
with a nil
node as a way to communicate there’s no more nodes to visited.
Using that knowledge we can now print something that looks more like a tree.
The rest of the code in our program remains unchanged, and executing it prints this output:
*ast.File
*ast.Ident
*ast.GenDecl
*ast.ValueSpec
*ast.Ident
*ast.BasicLit
What’s the most common name per kind of identifier?
Ok, so now that we understand how to parse code and visit its nodes, we are ready to extract the information we want: what are the most most common names for variables declared at package level vs those declared locally inside of a function.
We will start with a piece of code very similar to what we had in the previous episode, where we used go/scanner
over a list of files passed as command line arguments.
Running this program will print the AST of each one of the files given as command line argument. You can try it on its own by running:
$ go build -o parser main.go && parser main.go
# output removed for brevity
Let’s now change our visitor
to keep track of how many times each identifier is used for each kind of variable declaration.
First let’s start by tracking all short variable declarations, since we know they are always local declarations.
For each assignment statement we’re checking whether the identifier name is _
, which should be ignored, and whether this is the declaration point of the identifier. In order to do so we use the Obj
field which keeps track of all the objects declared in a context.
If the Obj
field is nil
we know that the variable was declared in a different file, therefore it’s not a local variable declaration and we can ignore it.
If we run this program on the whole standard library we’ll see that the most common identifiers are:
7761 err
6310 x
5446 got
4702 i
3821 c
Interestingly enough, v
doesn’t appear at all! Are we missing any other ways of declaring local variables?
Counting parameters and range variables too
We’re missing a couple node types to have a more completely analysis of local variables. These are:
- function parameters, receivers, and named returned values.
- range statements.
Since the code to count a local variable identifier will be repeated all over let’s define a method on visitor
instead.
For parameters and returned values we will have a list of identifiers. We also have the same for method receivers, even though they always have only one element. Let’s define an extra method for lists of identifiers.
Then we can use that method for all the node types that might declare local variables.
Great, let’s run this and see which one is the most common local variable name for now!
$ ./parser ~/go/src/**/*.go
most common local variable names
12264 err
9395 t
9163 x
7442 i
6127 c
Handling var declarations
Let’s move into handling the var
declarations. These are more interesting because they could be local or global, and the only way to know is to check whether they’re at the ast.File
level.
To do so we’re going to create a new visitor
per file which will keep track of the declarations that are global, so we can count the identifiers correctly.
To do so we’ll add a pkgDecls
field of type map[*ast.GenDecl]bool
in our visitor, and it will be initialized by a newVisitor
function. We’ll also add a globals
field tracking how many times an identifier has been declared.
Our main program will need to create a new visitor
per file and keep track of the total results.
Ok, we just have one more piece of the puzzle to complete. We need to track the *ast.GenDecl
nodes and find all the variable declarations in them.
For each declaration we will only count those that start with a token.VAR
, therefore ignoring constants, types, and other kind of identifiers. Then, for each value declared, we’ll check whether it’s a global or local declaration, and count them accordingly and ignoring _
.
The whole program is available here.
Executing the program will give us this result:
$ ./parser ~/go/src/**/*.go
most common local variable names
12565 err
9876 x
9464 t
7554 i
6226 b
most common global variable names
29 errors
28 signals
23 failed
15 tests
12 debug
So there you go, the most common local variable name is err
, while the most common package variable name is errors
.
Which one is the most common constant name? How would you find it?
Thanks
If you enjoyed this episode make sure you share it and subscribe to justforfunc and follow me on Medium or twitter! Also, consider sponsoring the series on patreon.