Detecting Code Indentation

Heather Arthur
Sep 8, 2014 · 6 min read

The Firefox developer tools use CodeMirror as the source editor in many places (including the new WebIDE). CodeMirror provides auto-indentation when starting a new line and can convert tab presses into indents. Both of these features are frustrating however if the editor is set to insert the wrong indentation. That’s why many text editors will detect the indentation used in a file when it’s loaded into the editor. I set about to add this indentation detection to the devtools.

The first order of business was getting a good default value for new files. Every codebase and programmer has a different preferred indentation. Some use 2 spaces per indent, some use 4, and some even use tabs. I went through a rebellious 3-space indentation phase myself, which unfortunately was one of the most prolific open source months of my life (or did 3-space indents just increase my productivity??).

It can get kind of contentious, to be honest. So I thought I might back the decision up with some data. Using the GitHub gist API, I downloaded random code files from several different languages at different times throughout the day until I got 100 files of each language. Then I manually classified each file. Here’s the breakdown per language based on this limited sample size:

So, there you go. 2-spaces edges out others in Web languages, while Python is all for 4-space indents and Ruby is all for 2-space indents. At least for these 100 files. The 2 vs. 4 difference isn’t statistically significant for JavaScript with that small sample size, but is for the other languages.

As for detecting the indentation in a file, there were a few different algorithms out there. I looked at the two popular ones: greatest common divisor and minimum width, as well as two other experiments I came up with: comparing lines and neural network.

Indentation detection isn’t completely straightforward. A file that a human would classify as 2-spaces will have indent widths of all sizes as indents get nested: 4, 6, 8, 10, 12. An example of a not-straightforward one would be a 2-space file that mainly consists of a class. So function definitions would be indented by 2 spaces, but function bodies would be indented by 4. You might only have a tiny portion of the file indented by 2 spaces, and a majority by 4.

Another problem is outliers: too-long lines chopped off and indented by say 37 spaces to line up with something on the previous line. Multi-line comments royally throw things. These block comments often start with an even-indented width, and the body gets indented by one more space (at least in JavaScript). If it’s a really descriptive comment, a good portion of the file could have an indent width of say 5.

These problems are easier to solve if you can parse the file, but I wanted a language-agnostic approach.

All these algorithms were focused on determining the indentation if you were using spaces to indent. But tab indents are possible, so in all algorithms I classified a file as “tabs” if there were more tabs than space indents. All the algorithms discounted all-whitespace lines. Additionally, in the gcd and min-width algorithms, I threw out indent widths that were less than a certain percent of the file. I won’t show these parts as they were common to several of the algorithms.

The algorithms

Greatest common divisor

We’ll see how this performs later.

Minimum width

Comparing lines

This method compares the indentation of each line with the previous line, and adds the difference to a tally. So if a line is indented by 10 spaces, and the previous by 8, one more vote would be added for 2-space indentation. The benefit to this is that a block comment could be any number of lines, but its indentation would only count twice. This means we don’t have to throw out odd-numbered widths (3-spacers rejoice!):

Neural network

I trained the network on about 100 hand-classified files, separate from the test files. Here’s the classification function:

The results

Unpatched (not removing outliers or odd widths):

% files correctly detected, out of 500 (unpatched):
gcd: 78.0%
minimum: 87.6%
compare: 95.4%
neuralnet: 94.8%

Patched (removing outliers and odd widths, where applicable):

% files correctly detected, out of 500 (patched):
gcd: 94.4%
minimum: 92.8%
compare: 96.8%
neuralnet: 94.8%

Out of this, the only statistically significant result is that the compare-lines algorithm is better than the minimum-width algorithm. That conclusion comes from a two-tailed Z test with p < 0.05. More differences could solidify if more gists were tested, however.

My summary is that once you patch the weaknesses of a particular one of these algorithms, it performs about as well as the other. At least on random files from the wild, where edge cases are few.

The neural network performed well, and could have performed better if I’d given it more samples. 100 training samples is not a lot. What’s nice about the neural network approach is you don’t have to figure out the outliers and special cases yourself. Look at this code I came up with for finding outliers, after several guess-and-check rounds:

function isOutlier(total, count) {
return (total > 40 && count < (total / 20))
|| (total > 10 && count < 3)
|| (total > 4 && count < 2);

Even after playing around with a bunch of different values, I haven’t found the best way to detect an outlier. The network, however, will probably figure out the best way for the data it’s trained on. It’ll learn that a single indentation of width 37 shouldn’t have much sway in the final decision. In that way, it doesn’t have the weaknesses the other algorithms have (me).

You can “see” (hopefully not even notice) the compare-lines algorithm in action in the latest Firefox release.

Firefox Developer Tools

Stories and news about the Firefox developer tools…

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store