Refactoring at Scale with Abstract Syntax Trees

Working on an old codebase can be stifling at times. You have a great idea for a new pattern, but there’s just so much code that you can’t feasibly fix it all. Much of the time, the team agrees to leave the codebase today as it is and only apply the pattern to new additions. This fragments the codebase into the Way Things are Done and the Dark Land of Legacy, where senior coders groan and juniors dare not tread. The more ideas the team has, the more fragments there are, until the code is lost to darkness.

It doesn’t have to be that way, however. With the proper scripting, you can make changes to vast swaths of your codebase. That’s where abstract syntax trees, or ASTs, come in.

What are ASTs?

ASTs are data structures used for representing code. Specifically, they’re the way that your language’s compiler or interpreter represents it. They are regular tree data structures in which each node is a kind of syntax. For example, something like this:

x = y + 2

is represented as an AST like this:

Assignment: "="
|- Identifier: "x"
|- BinaryOperator: "+"
|- Identifier: "y"
|- NumericLiteral: "2"

This might seem a bit verbose, but it has lots of advantages over just parsing the text yourself, including:

  1. No weird whitespace, comments, or other formatting to mess with.
  2. Order of operations is always clear.
  3. Symbols are searchable based on their meaning in code instead of their text makeup (“=” vs “==”, “(“ in a function call or declaration, etc).
  4. Compiler-inferred information such as types or errors is available.
  5. Telling people that you’re working with the compiler API makes you seem really smart.

You can use the wonderful ASTExplorer.net to compile and inspect ASTs of your code live.

A minimal linter

One of the easiest things you can do with ASTs is build your own linter, which will automatically analyze your code and report errors. For this example we’ll be using Typescript and its compiler API, but if Typescript isn’t your thing, try out the equivalent with Python, PHP, C#, or Java. We’ll be producing a script that warns whenever a function’s name starts with an uppercase letter.

First, we provide our file to the compiler:

import * as fs from 'fs';
import * as ts from 'typescript';
const fileName = process.argv[2];
const
sourceFile = ts.createSourceFile(
fileName, // File name
fs.readFileSync(fileName).toString(), // File contents
ts.ScriptTarget.ES2017, // JS dialect
true, // Build out the AST
);

Next, we write a small tree traversal function that will perform a depth-first walk over the whole tree, searching for errors:

function traverse(node: ts.Node) {
if (ts.isFunctionDeclaration(node)) {
const name = node.name.getText();
if (name.charAt(0).match(/[A-Z]/)) {
const pos = sourceFile.getLineAndCharacterOfPosition(
node.getStart()
);
console.error(
`You really messed up on line ${pos.line}` +
` when you called your function ${name}.`
);
}
}
node.forEachChild(traverse);
}
// Start the traversal
sourceFile.forEachChild(traverse);

With that, we have a full linter in only 25 lines of code! Make sure to adjust the sassiness level of your errors to fit your company’s culture.

Rewriting code

As the bosses of the world have said many times, it’s one small step from complaining about an error to fixing it. Let’s modify the code that prints out the error message to write out the correct version instead:

// Add this at the top level
const spansToReplace = [];
// Add this after the console.error call
spansToReplace.push({
start: node.getStart(),
end: node.getEnd(),
replacement: name.charAt(0).toLowerCase() + name.substr(1),
});
// Put this at the end
let newText = sourceFile.getFullText();
spansToReplace.forEach(span => {
newText = newText.substring(0, span.start)
+ span.replacement
+ newText.substring(span.end);
});
console.log(newText);

The original AST is read-only, so we need to maintain a list of replacements ourselves and then output a modified version of the original file. We can do this by keeping track of where we found the bad nodes during the tree search and then replacing everything in one go.

Now, running this on every file in your codebase will automatically fix every function declaration, regardless of the number of files you have!

Expand your horizons

This is just a simple introduction to linting and code modifications. The example above can be modified to check for any number of things, even across source files. Here are some ideas to try yourself:

  • Find var or let declarations that are never modified, and convert them to const.
  • Search across files for public fields that are never accessed, and convert them to private.
  • Replace public fields and their accesses with generated getters and setters.
  • Use the type checker to add explicit type declarations to variables based on the return types of function calls.
  • Integrate your custom linter with TSLint to get IDE highlighting.

Of course, you don’t have to do it all from scratch. ASTs are the basis of popular tools like TSLint and JSCodeshift. There are even ease-of-use libraries for the Typescript API like tsutils and ts-simple-ast. Armed with the knowledge you’ve gained here you’ll be able to take these and expand on them to bring the light back to your legacy codebase.