Using rubocop-ast to transform Ruby files using Abstract Syntax Trees

Daniel Orner
Flipp Engineering
Published in
8 min readAug 13, 2020
Image from Pixabay

Sometimes as you go about your business of being a software engineer, a problem crops up that is unusually interesting, potentially really useful, and also seems like it should be relatively easy to do, given the age of your chosen programming language and the tools already available.

Such a situation happened recently with the introduction of a new “item indexer” project at Flipp. This was an attempt to replace a patchwork of existing systems that fetched item information from retailer websites with a single microservice that could do it for all requests.

So far, so groovy — but if we were going to create a unified service, we had to figure out what to do with all the existing indexers we had written in a variety of ways. Every retailer website is different. Some are well-designed, and contain common information in easy-to-read OpenGraph headers or itemprop properties. Many others, though, don’t provide readily available information in nicely semantic tags, and instead we need to rely on xpaths that look like .//div[@class='col-right']/div[@class='product-desc']/h1 .

By the time we were ready to start the unified indexer project, the count of patchwork indexers was in the hundreds.

The new indexer service had a slightly different format than the old one, largely because we’d moved from the (now out-of-date) Watir gem to Capybara. So… either we spent days manually rewriting each indexer to adhere to the new format, or we tried doing something a bit smarter.

The Changes

Here are some of the major changes we were trying to achieve:

# Module nesting
module ContentAggregation -> module Indexers
module JSParsers class MyRetailer < Base
class MyRetailer < Base ...
... end
end end
end
end
# Method renaming
def supplemental_info_price -> def base_price
... ...
end end
# Rename browser methods
page.at_xpath("an_xpath") -> page.find("an_xpath")
page.xpath("an_xpath") -> page.all("an_xpath")
div.inner_text -> div.text
# Replace unnecessary blank checks, handled by new framework
a.blank? ? a : nil -> a

Attempt 1: gsub!

The first attempt at translating the indexers was just to use good old gsub and regular expressions. This can actually get us pretty far! It’s pretty good at replacing the top part of each file (the module declaration). However, it relies heavily on every item indexer being written exactly the same way, including spacing, which may or may not be true.

One of the big shortcomings is how we’re replacing all instances of end sandwiched by newlines with nothing, and assuming that the only case of that would be the one that ends the module (since we’re reducing nesting by one level). That’s pretty dangerous!

Also note that we didn’t even attempt to do things like replacing the blank check with nil, because of the different variations on spacing and things like that. We could try to spend a lot more time making it more robust, but we’d still have a hard time (e.g.) matching the start of methods to their ends.

Attempt 2 — Climb the Tree

Image from Pixabay

Instead of trying to blindly change the text of the Ruby file, I thought it would be much more powerful if we could instead parse and understand the Ruby code itself, and output changes to it. I knew that Rubocop, the Ruby linter and formatter, does that with its auto-correct (where e.g. it can change find to select or vice versa), so it has to be possible somehow, right?

An Abstract Syntax Tree (AST) is an internal representation of a piece of code. It’s not specific to Ruby; every language has its own AST as an important step on the way to turn text into machine code. I figured the best way to be sure that I was changing things correctly would be to have some way to key into the AST of the code and make changes at the tree level.

The Options

The first result on Google for “ruby parse code” is the parser gem, a well-maintained library for parsing Ruby code. It even comes with rewriting capability! However, I found the documentation difficult to get through, and there didn’t seem to be a simple example for reading a file, changing it, and writing it back out.

I figured I’d try to find something maybe a bit more high-level to work with. The next candidate I found was synvert, which provides the ability to create “snippets” to transform code. This powers transpec, a great tool that converts RSpec 2 syntax into RSpec 3.

Unfortunately, after playing around with this, it turned out it was a little too high-level. When working with very simple replacements, the DSL seems like it would be a great match, but once you go past straightforward insertions or replacements, it becomes unwieldy fast.

After taking a step back, I thought about the original inspiration for this idea, which was Rubocop. How does Rubocop actually do all of its (very complex, at times) rewriting? There are hundreds of “cops” which auto-correct, with dozens of contributors, so there must be an understandable way to write this code! Turns out, it uses good old parser under the hood, but with some additions that make it much sweeter.

In fact, the Rubocop project made so many enhancements to parser that they extracted them to a separate gem, rubocop-ast. To understand why it’s so much easier to work with, let’s dive into how parser represents some simple Ruby code:

def increment(x=1)
x + 1
end
->(def :increment
(args
(optarg :x
(int 1)))
(send
(lvar :x) :+
(int 1)))

The indented syntax is compact but hard to read, so here’s what a tree version might look like:

The blue boxes with bold text represent literals (symbols, integers) while the rest represent nodes.

When I started looking at this, I was kind of overwhelmed. What is an lvar? (Answer: Left-hand-side variable in an assignment.) Where’s the block representing the actual method definition? (Answer: Since it’s a single line, it doesn’t have one.) Where are the symbols like ( and =? (Answer: They aren’t represented in the tree at all.) Digging deeper — if I only want to trigger a change on . and not &. null-safe nodes, how do I ensure that? (Answer: They are separate node types, send and csend.)

More of a problem is that the number of node types in parser are staggering. Node types are defined semantically rather than syntactically, meaning that the if node could represent an actual if statement, or a ternary ( foo ? 1 : 0 ), and only by inspecting the node children could you figure out which one it was.

rubocop-ast builds on top of the parser gem in two major ways. One is the introduction of specialized Node classes, with a huge number of semantic methods that allow you to more naturally deal with nodes when you have them. For example, the IfNode comes with a ternary? expression, which tells you by inspecting its children whether it’s a ternary or not; the class or module nodes have parent_class methods letting you reference the parent of the declaration.

The second addition is the NodePattern DSL, which is an XPath-like syntax allowing you to match a particular node against a pattern. Both of these are extensively used in Rubocop cops. I didn’t find NodePattern very comfortable, as it reminded me too much of regular expressions, which I find more difficult to reason about than actual code.

Although rubocop-ast is a great tool, its documentation was somewhat lacking. In particular, I still couldn’t find a good example of how to rewrite a file. Thankfully, the code itself was understandable enough for me to piece together the right way to do it. In addition, the amount of time I spent poring over the parser docs led me to create a new exhaustive documentation page for rubocop-ast listing the nodes by type (rather than mapping Ruby features to nodes as in the parser docs).

Rewriting Code

With all that said, how do you actually use rubocop-ast to rewrite code? There are two important parts to the parser gem that are critical to this: TreeRewriter and Source::Map .

Each node has a Source::Map you can access by calling node.location (or node.loc). This object is a mapping of AST nodes to actual source code. For example, taking the above example of Ruby code:

def increment(x=1)
x + 1
end

The send node representing the single statement in the function would have the following source map:

  • expression: 21..26 , source: x + 1
  • selector: 23..24 , source: +

Each node type will have different location keys depending on what it is (so for example, an if node might have question and colon keys for ternaries). Note that e.g. the expression key includes child nodes as well. Having this information gives us a lot of power by being able to insert, delete or replace content around any individual part of this particular node.

That’s what TreeRewriter is for — once you have one set up, you can use its methods to change the source code by giving it a rangethat references the source. The cool thing about TreeRewriter is that it’s incredibly smart about how to change the code. Each change is put into a queue, and the changes are executed bit by bit, ensuring that the intention of the change isn’t modified.

For example, you could replace the + with a plus method, and then also add a space before the 1. Here’s a bad way to do that which only records the indexes where you want the changes to happen:

123456789
x + 1 (the 1 is at char 5)
x plus 1
x pl us 1 <<- OOPSIE

With TreeRewriter, all changes happen relatively, and are replayed correctly:

123456789
x + 1
x plus 1
x plus 1

The Solution

Without further ado, here’s what I landed on for my rewriting project. I created a number of processors called Rules, one for each change I wanted to do. I left it up to the TreeRewriter to process them and make sure they didn’t clobber each other. This is just three of the rules I made, but the great thing about this design is that I can make as many as I like and it’ll keep on truckin’!

I felt like this combination of tools gave me the power I needed with the ease of use so that other people could look at the code and not be incredibly confused. I’m happy with what I landed on and hope people find it useful!

--

--