US National Archives via Wikipedia (Public Domain)

Finding Typos in Data

Data analysis involves a lot of technicalities, but also sometimes accounting for human error. If data have been manually keyed in, typos are inevitable. Checking numerical values for outliers can find some of these mistakes; but what should we look for?

There are many kinds of typo, of course, and some would be difficult to detect. If a 6 is pushed instead of a 7 in a long string of digits, who can say that’s not a reasonable value? You might be able to judge against historical data points, using typical outlier detection methods, but not necessarily.

However, very often the typos we’re most concerned with are the egregious ones. We want to prioritize finding those , because there are too many modesttypos to investigate individually, or to throw out en masse. Very common egregious typos include omitting a digit and duplicating one (or more than one). This is especially easy when there are strings of digits in a figure, like “70,000” becoming “700,000”.

I was curious about the size of the numerical change these kinds of error would create, since that would be the best way to detect them. I wrote a quick simulation in R to randomly drop digits or duplicate them, and ran it on a dummy data set. Here are the two functions (not robust since they’re just for testing):

dupdigit <- function(x, d=NULL) {
x <- as.character(format(x, scientific=F))
if(is.null(d)) d <- sample(1:nchar(x), 1)
xd <- paste(substr(x, 1, d), substr(x, d, nchar(x)), sep='')
return(as.numeric(xd))
}
dropdigit <- function(x, d=NULL) {
x <- as.character(format(x, scientific=F))
if(is.null(d)) d <- sample(1:nchar(x), 1)
xd <- paste(substr(x, 0, d-1), substr(x, d+1, nchar(x)), sep='')
return(as.numeric(xd))
}

Results

I was half expecting to get some sort of non-intuitive result, but in fact the results are pretty straightforward.

  • When duplicating a digit, you will typically increase your value by 10-fold. The smallest increases are about x6, and the largest are x11.
  • When dropping a digit, values are typically reduced by 10-fold. The most modest reductions are four-fold. The worst, though, can make a figure 0; logically, since you can start with values of “1,000” and lose the place-keeping leading digit.

These findings remain broadly true even when many values are rounded off, with low-place digits that are always zero. They change slightly if there is a natural cut-off in the digits that appear — like scores that range only from 1 to 700. It appears to be similar across various distributions of values: uniform, normal and log-normal at least.

To detect these egregious typos, then, you can look for values at least 6 times larger than expected, or four times smaller. Some cases could even be fixed automatically (with some guess-work).


Originally published at ideabyre.com.