Improving Go Service Reliability with Static Analysis

Published in

Udacity Eng & Data

5 min readJan 13, 2022

Udacity runs on dozens of microservices. This ecosystem has grown over several years at the hands of many engineers using multiple programming languages. As with any organization of this size, there can be major differences from one service to another especially if they’re owned by different engineering teams. Those differences are a boon when they help teams deliver on company and organizational goals but, at a certain level, it becomes difficult to audit such a diverse environment.

We recently built and deployed a system with the goal of auditing our services on a specification of key reliability rules. The specification was kept small to prioritize quick and easy adoption by services. That specification is essentially these three points:

1. Services MUST enforce graceful shutdown.
2. Services MUST enforce timeouts on incoming requests.
3. Services MUST enforce timeouts on outgoing requests.

Introducing ucheck

At Udacity, we have a habit of naming things by prefixing a u on an otherwise common name. So after considering several alternatives, we decided to call our code-checking system ucheck. The system as a whole has five parts:

- ucheck-go: a static analysis tool to measure compliance in Go services
- ucheck-node: the same kind of tool but for Node.js services
- ucheck-report: a CLI tool for reporting analytics from the above tools
- ucheck-api: the API that collects analysis reports
- dashboards: visualizations of the current state of compliance

To put it another way, all of our services’ CI configs gained a task that effectively does the following:

Run ucheck-go or ucheck-node against the code
Pipe the output of the analysis into ucheck-report
ucheck-report makes API calls to ucheck-api
ucheck-api stores results in its database

Dashboards now have fresh data for visualizations:

Charts showing counts of Go and Node services that are in compliance — Sample of dashboard results.

The details of ucheck-report and ucheck-api are not especially interesting, but the code in ucheck-go is worth discussing. ucheck-node is also interesting, but for now I will summarize that it is a wrapper around eslint using custom rules.

The remainder of this article is a tutorial on writing a Go static analysis tool.

Problem definition

Our approach to specification rules 1 and 2 (“services MUST implement graceful shutdown” and “services MUST enforce timeouts on incoming requests”) was to implement that behavior correctly once in ugo, our internal package for common Go utilities, and require services to use that package instead of launching HTTP servers directly. The ucheck graceful analyzer was written to detect code that was still launching servers manually.

To do this, first we must look at the Abstract Syntax Tree (AST) generated from Go source code. For example, consider this source code:

We want our analyzer to find the code on line 13. Viewing this source code on the AST Explorer website shows us that line is an Expression Statement, specifically a Call Expression (CallExpr).

A section of the AST for the example code — A section of the AST for our example code.

In our analyzer we want to look through the AST of our target code for any Call Expressions that target http.ListenAndServe (or any of its variations).

Writing an analyzer

Our static analysis tool is built on the golang.org/x/tools/go/analysis framework which is the same code that powers the go vet command. To start, we define our analyzer like this:

The exported package variable Analyzer is the primary export of this package. The Name and Doc fields are strings which describe the analyzer. The Requires field is an optional list of other analyzers that will be run before this analyzer. In my experience, most analyzers will depend on inspect.Analyzer from the analysis package because it has already done the work of building an AST inspector for the files in the analysis.

Finally, the Run property takes a function which does the actual work for this Analyzer. The run function for this analyzer is defined as:

The pass argument is the primary way you interact with the analyzer. Remember the Requires property of the analyzer where we depended on another analyzer to have ran first? On line 9 of this snippet we get the result of that run and type assert it to an AST Inspector. This type allows us to execute code against just the CallExpr nodes of the AST without walking the entire AST ourselves. This is what we do on line 13–16. On line 14, we specify the types of nodes we care about (just CallExpr) and on line 15 we provide an anonymous function to execute for each one. For readability, I have the logic extracted to another function inspectNode which is given the node in question as well as the pass.

The CallExpr contains enough information for us to identify the function being called, but it is a little tedious to inspect manually, so we use the typeutil.Callee function to get an object that identifies the function. With that in hand, we just have to compare the function’s full name against a list of functions/methods to see if it is launching an HTTP server. When we find an offending line, we use the pass.Reportf method to report diagnostics and the node’s position.

Testing our analyzer

One of my favorite features of the analysis framework is its built in support for writing tests. First create this simple test file:

Then create the folder structure testdata/src/basic/ and in that folder create basic.go. In that file, write Go code that should trigger your analyzer and add comments that include regular expressions of the diagnostic you expect, like so:

The comments on lines 9, 14, 20, and 26 set our expectations. If our analyzer does not report diagnostics on those lines then our test fails.

Running the analyzer

The analyzer is working in tests, but we have just a little more work to do, so we can run it against production code. The analysis framework makes this easy too with the multichecker package. To get a full featured CLI tool, we just need to pass our analyzer (plus two other analyzers we wrote) into the multichecker.Main function:

Now we can compile that into ucheck-go and run it against code like this:

$ ucheck-go .
main.go:13:2: replace net/http.ListenAndServe with ugo/http/graceful.Serve

With the analysis framework and a little glue code, we now have the ability to audit our services’ compliance with our reliability specifications, and more analyzers can be written if needed.