Detecting Code Indentation

The Firefox developer tools use CodeMirror as the source editor in many places (including the new WebIDE). CodeMirror provides auto-indentation when starting a new line and can convert tab presses into indents. Both of these features are frustrating however if the editor is set to insert the wrong indentation. That’s why many text editors will detect the indentation used in a file when it’s loaded into the editor. I set about to add this indentation detection to the devtools.

The first order of business was getting a good default value for new files. Every codebase and programmer has a different preferred indentation. Some use 2 spaces per indent, some use 4, and some even use tabs. I went through a rebellious 3-space indentation phase myself, which unfortunately was one of the most prolific open source months of my life (or did 3-space indents just increase my productivity??).

It can get kind of contentious, to be honest. So I thought I might back the decision up with some data. Using the GitHub gist API, I downloaded random code files from several different languages at different times throughout the day until I got 100 files of each language. Then I manually classified each file. Here’s the breakdown per language based on this limited sample size:

So, there you go. 2-spaces edges out others in Web languages, while Python is all for 4-space indents and Ruby is all for 2-space indents. At least for these 100 files. The 2 vs. 4 difference isn’t statistically significant for JavaScript with that small sample size, but is for the other languages.

As for detecting the indentation in a file, there were a few different algorithms out there. I looked at the two popular ones: greatest common divisor and minimum width, as well as two other experiments I came up with: comparing lines and neural network.

Indentation detection isn’t completely straightforward. A file that a human would classify as 2-spaces will have indent widths of all sizes as indents get nested: 4, 6, 8, 10, 12. An example of a not-straightforward one would be a 2-space file that mainly consists of a class. So function definitions would be indented by 2 spaces, but function bodies would be indented by 4. You might only have a tiny portion of the file indented by 2 spaces, and a majority by 4.

Another problem is outliers: too-long lines chopped off and indented by say 37 spaces to line up with something on the previous line. Multi-line comments royally throw things. These block comments often start with an even-indented width, and the body gets indented by one more space (at least in JavaScript). If it’s a really descriptive comment, a good portion of the file could have an indent width of say 5.

These problems are easier to solve if you can parse the file, but I wanted a language-agnostic approach.

All these algorithms were focused on determining the indentation if you were using spaces to indent. But tab indents are possible, so in all algorithms I classified a file as “tabs” if there were more tabs than space indents. All the algorithms discounted all-whitespace lines. Additionally, in the gcd and min-width algorithms, I threw out indent widths that were less than a certain percent of the file. I won’t show these parts as they were common to several of the algorithms.

The algorithms

Greatest common divisor

The greatest common divisor (gcd) is a math concept. The gcd of [4, 6, 8, 10] is 2. A file with indent widths of [4, 6, 8, 10] would also clearly be a 2-space indented file. Things go haywire when you get any outlier indents of say, 37 though. The gcd of [4, 6, 8, 37] is 1. Multi-line comments really throw this one off, so for practicality you have to throw out odd numbers. Here’s the JavaScript code for this:

We’ll see how this performs later.

Minimum width

The other common algorithm I saw was a simple one: just take the smallest indentation width you see in the file. This can also trip up a bit on multi-line comments. 1 is a common minimum indent in that case, so we have to chuck that:

Comparing lines

I thought about what I did when I manually classified a file. I noticed that when I opened a file, I would focus on a random line and scan until I hit the next indent, then I would make note of that, and do this for a few other random lines in the file. So I coded up something that mimics this procedure.

This method compares the indentation of each line with the previous line, and adds the difference to a tally. So if a line is indented by 10 spaces, and the previous by 8, one more vote would be added for 2-space indentation. The benefit to this is that a block comment could be any number of lines, but its indentation would only count twice. This means we don’t have to throw out odd-numbered widths (3-spacers rejoice!):

Neural network

I had to see how machine learning would stack up. I wish I could feed the raw data in — the indent widths of the first n lines. But I couldn’t figure out how to turn inputs like [4, 8, 10] into the continuous (non-discrete) signals that the network required (update: immediately upon writing this I thought of a way, but it didn’t perform as well). So I fed it the popularity of each width instead. Unlike the other algorithms, I didn’t throw out anything. Outlier widths and odd widths went in with the rest of them.

I trained the network on about 100 hand-classified files, separate from the test files. Here’s the classification function:

The results

I ran all the algorithms on the set of 500 total code files in JavaScript, HTML, CSS, Python, and Ruby. Here’s what percent of the files each algorithm detected correctly:

Unpatched (not removing outliers or odd widths):

% files correctly detected, out of 500 (unpatched):
gcd: 78.0%
minimum: 87.6%
compare: 95.4%
neuralnet: 94.8%

Patched (removing outliers and odd widths, where applicable):

% files correctly detected, out of 500 (patched):
gcd: 94.4%
minimum: 92.8%
compare: 96.8%
neuralnet: 94.8%

Out of this, the only statistically significant result is that the compare-lines algorithm is better than the minimum-width algorithm. That conclusion comes from a two-tailed Z test with p < 0.05. More differences could solidify if more gists were tested, however.

My summary is that once you patch the weaknesses of a particular one of these algorithms, it performs about as well as the other. At least on random files from the wild, where edge cases are few.

The neural network performed well, and could have performed better if I’d given it more samples. 100 training samples is not a lot. What’s nice about the neural network approach is you don’t have to figure out the outliers and special cases yourself. Look at this code I came up with for finding outliers, after several guess-and-check rounds:

function isOutlier(total, count) {
return (total > 40 && count < (total / 20))
|| (total > 10 && count < 3)
|| (total > 4 && count < 2);

Even after playing around with a bunch of different values, I haven’t found the best way to detect an outlier. The network, however, will probably figure out the best way for the data it’s trained on. It’ll learn that a single indentation of width 37 shouldn’t have much sway in the final decision. In that way, it doesn’t have the weaknesses the other algorithms have (me).

You can “see” (hopefully not even notice) the compare-lines algorithm in action in the latest Firefox release.