Using rubocop-ast to transform Ruby files using Abstract Syntax Trees
Sometimes as you go about your business of being a software engineer, a problem crops up that is unusually interesting, potentially really useful, and also seems like it should be relatively easy to do, given the age of your chosen programming language and the tools already available.
Such a situation happened recently with the introduction of a new “item indexer” project at Flipp. This was an attempt to replace a patchwork of existing systems that fetched item information from retailer websites with a single microservice that could do it for all requests.
So far, so groovy — but if we were going to create a unified service, we had to figure out what to do with all the existing indexers we had written in a variety of ways. Every retailer website is different. Some are well-designed, and contain common information in easy-to-read OpenGraph headers or itemprop properties. Many others, though, don’t provide readily available information in nicely semantic tags, and instead we need to rely on xpaths that look like .//div[@class='col-right']/div[@class='product-desc']/h1
.
By the time we were ready to start the unified indexer project, the count of patchwork indexers was in the hundreds.
The new indexer service had a slightly different format than the old one, largely because we’d moved from the (now out-of-date) Watir gem to Capybara. So… either we spent days manually rewriting each indexer to adhere to the new format, or we tried doing something a bit smarter.
The Changes
Here are some of the major changes we were trying to achieve:
# Module nesting
module ContentAggregation -> module Indexers
module JSParsers class MyRetailer < Base
class MyRetailer < Base ...
... end
end end
end
end# Method renaming
def supplemental_info_price -> def base_price
... ...
end end# Rename browser methods
page.at_xpath("an_xpath") -> page.find("an_xpath")
page.xpath("an_xpath") -> page.all("an_xpath")
div.inner_text -> div.text# Replace unnecessary blank checks, handled by new framework
a.blank? ? a : nil -> a
Attempt 1: gsub!
The first attempt at translating the indexers was just to use good old gsub
and regular expressions. This can actually get us pretty far! It’s pretty good at replacing the top part of each file (the module declaration). However, it relies heavily on every item indexer being written exactly the same way, including spacing, which may or may not be true.
One of the big shortcomings is how we’re replacing all instances of end
sandwiched by newlines with nothing, and assuming that the only case of that would be the one that ends the module (since we’re reducing nesting by one level). That’s pretty dangerous!
Also note that we didn’t even attempt to do things like replacing the blank check with nil, because of the different variations on spacing and things like that. We could try to spend a lot more time making it more robust, but we’d still have a hard time (e.g.) matching the start of methods to their ends.
Attempt 2 — Climb the Tree
Instead of trying to blindly change the text of the Ruby file, I thought it would be much more powerful if we could instead parse and understand the Ruby code itself, and output changes to it. I knew that Rubocop, the Ruby linter and formatter, does that with its auto-correct (where e.g. it can change find
to select
or vice versa), so it has to be possible somehow, right?
An Abstract Syntax Tree (AST) is an internal representation of a piece of code. It’s not specific to Ruby; every language has its own AST as an important step on the way to turn text into machine code. I figured the best way to be sure that I was changing things correctly would be to have some way to key into the AST of the code and make changes at the tree level.
The Options
The first result on Google for “ruby parse code” is the parser gem, a well-maintained library for parsing Ruby code. It even comes with rewriting capability! However, I found the documentation difficult to get through, and there didn’t seem to be a simple example for reading a file, changing it, and writing it back out.
I figured I’d try to find something maybe a bit more high-level to work with. The next candidate I found was synvert, which provides the ability to create “snippets” to transform code. This powers transpec, a great tool that converts RSpec 2 syntax into RSpec 3.
Unfortunately, after playing around with this, it turned out it was a little too high-level. When working with very simple replacements, the DSL seems like it would be a great match, but once you go past straightforward insertions or replacements, it becomes unwieldy fast.
After taking a step back, I thought about the original inspiration for this idea, which was Rubocop. How does Rubocop actually do all of its (very complex, at times) rewriting? There are hundreds of “cops” which auto-correct, with dozens of contributors, so there must be an understandable way to write this code! Turns out, it uses good old parser
under the hood, but with some additions that make it much sweeter.
In fact, the Rubocop project made so many enhancements to parser
that they extracted them to a separate gem, rubocop-ast. To understand why it’s so much easier to work with, let’s dive into how parser
represents some simple Ruby code:
def increment(x=1)
x + 1
end->(def :increment
(args
(optarg :x
(int 1)))
(send
(lvar :x) :+
(int 1)))
The indented syntax is compact but hard to read, so here’s what a tree version might look like:
The blue boxes with bold text represent literals (symbols, integers) while the rest represent nodes.
When I started looking at this, I was kind of overwhelmed. What is an lvar
? (Answer: Left-hand-side variable in an assignment.) Where’s the block representing the actual method definition? (Answer: Since it’s a single line, it doesn’t have one.) Where are the symbols like (
and =
? (Answer: They aren’t represented in the tree at all.) Digging deeper — if I only want to trigger a change on .
and not &.
null-safe nodes, how do I ensure that? (Answer: They are separate node types, send
and csend
.)
More of a problem is that the number of node types in parser
are staggering. Node types are defined semantically rather than syntactically, meaning that the if
node could represent an actual if
statement, or a ternary ( foo ? 1 : 0
), and only by inspecting the node children could you figure out which one it was.
rubocop-ast
builds on top of the parser
gem in two major ways. One is the introduction of specialized Node
classes, with a huge number of semantic methods that allow you to more naturally deal with nodes when you have them. For example, the IfNode
comes with a ternary?
expression, which tells you by inspecting its children whether it’s a ternary or not; the class
or module
nodes have parent_class
methods letting you reference the parent of the declaration.
The second addition is the NodePattern DSL, which is an XPath-like syntax allowing you to match a particular node against a pattern. Both of these are extensively used in Rubocop cops. I didn’t find NodePattern very comfortable, as it reminded me too much of regular expressions, which I find more difficult to reason about than actual code.
Although rubocop-ast
is a great tool, its documentation was somewhat lacking. In particular, I still couldn’t find a good example of how to rewrite a file. Thankfully, the code itself was understandable enough for me to piece together the right way to do it. In addition, the amount of time I spent poring over the parser
docs led me to create a new exhaustive documentation page for rubocop-ast
listing the nodes by type (rather than mapping Ruby features to nodes as in the parser
docs).
Rewriting Code
With all that said, how do you actually use rubocop-ast
to rewrite code? There are two important parts to the parser
gem that are critical to this: TreeRewriter
and Source::Map
.
Each node has a Source::Map
you can access by calling node.location
(or node.loc
). This object is a mapping of AST nodes to actual source code. For example, taking the above example of Ruby code:
def increment(x=1)
x + 1
end
The send
node representing the single statement in the function would have the following source map:
expression
:21..26
, source:x + 1
selector
:23..24
, source:+
Each node type will have different location keys depending on what it is (so for example, an if
node might have question
and colon
keys for ternaries). Note that e.g. the expression
key includes child nodes as well. Having this information gives us a lot of power by being able to insert, delete or replace content around any individual part of this particular node.
That’s what TreeRewriter
is for — once you have one set up, you can use its methods to change the source code by giving it a range
that references the source. The cool thing about TreeRewriter
is that it’s incredibly smart about how to change the code. Each change is put into a queue, and the changes are executed bit by bit, ensuring that the intention of the change isn’t modified.
For example, you could replace the +
with a plus
method, and then also add a space before the 1
. Here’s a bad way to do that which only records the indexes where you want the changes to happen:
123456789
x + 1 (the 1 is at char 5)
x plus 1
x pl us 1 <<- OOPSIE
With TreeRewriter
, all changes happen relatively, and are replayed correctly:
123456789
x + 1
x plus 1
x plus 1
The Solution
Without further ado, here’s what I landed on for my rewriting project. I created a number of processors called Rules, one for each change I wanted to do. I left it up to the TreeRewriter
to process them and make sure they didn’t clobber each other. This is just three of the rules I made, but the great thing about this design is that I can make as many as I like and it’ll keep on truckin’!
I felt like this combination of tools gave me the power I needed with the ease of use so that other people could look at the code and not be incredibly confused. I’m happy with what I landed on and hope people find it useful!