Semi-Automated Text Cleaning in R

An Open Refine to R Compiler

Cleaning real life textual data is hard. Weather it’s convention inconsistencies, manual data entry mistakes, or a myriad of other reasons, reaching a consistent representation is essential. This is most pronounced when dealing with categorical variables (like city or job title) where an inconsistent representation would give rise to superfluous categories that do not exist in reality.

A practical case in point, to serve as running example through this post, is the Springleaf Kaggle contest where the availed dataset had 3 columns that were in dire need of cleaning; those are:

  1. City Name (column: VAR_0200)
  2. Job Title (columns VAR_0404)
  3. Occupation (column: VAR_0493)

To get a feeling for the problem, here is a snapshot of the actual data:

Sample of data (-1 is missing data)

A manual inspection and replacement of the values is out of the question due to the large number of records (around 120 thousand rows). The alternative route of fully automated clustering of text based on some distance measure (traditionally some Levenshtein distance), but this leaves lots of room for error. For instance a single threshold to deem two strings the same is too simplistic to work in all cases giving rise to a need for multiple thresholds.

What we need is an efficient tool that allows automated clustering, manual inspection and verification of the choices the clustering algorithms take. Here OpenRefine comes to the rescue. Think of Open Refine as a data cleaning swiss army knife. It comes loaded with many great features, but for this context we will focus on the following string clustering mechanisms (read more about those in Open Refine’s documentation):

  • Fingerprint collision: this normalized text in terms of case, white space, token order, and does an equality comparison. This is the most accurate, and is safe to take its result without inspection. It is also useful to do this first before the other algorithms.
  • N-Gram Fingerprint: this is a variant of the fingerprint, but instead of comparing whole token for equality, it compares n-grams based on your choice of how many n-grams to compare.
  • Nearest Neighbor: This is the standard string clustering using a similarity measure, where OpenRefine offers the standard Levenshtein Distance.
  • Phonetic Fingerprint: this is useful in larger text, especially in latin based languages like English, but usually didn’t perform well with the data we encountered.

Clustering and editing the data can be reached from the edit menu for a specific column.

Column Menu, Edit cells, Cluster and edit

Running Fingerprint on the Occupation column (column: VAR_0493) gives the following result:

It is clear where how inconsistent conversions of the order of tokens caused several textual values which are practically identical to be different.

Following the fingerprint clustering with Nearest Neighbor using Levenshtein distance, we see clusters as follows:

Here manual inspection comes in handy: For example something like the different “WASTEWATER WORKS OPERATORS” classes may actually be correct as different values.

After cleaning the 3 aforementioned columns in Open Refine, one would probably export the cleaned data and load it back into R to continue feature extraction and modeling efforts. However, in the whole flow of running operations on a data set a manual step, of exporting data to Open Refine and then loading it back into R, sticks out like a sore thumb in an otherwise automated pipeline.


An OpenRefine to R compiler

Open Refine comes with a handy “Extract Operation History” feature (under “”Undo/Redo”) that allows one to export the edits made by the clustering procedures in a JSON format that looks like this:

[
{
“op”: “core/mass-edit”,
“description”: “Mass edit cells in column VAR_0493”,
“engineConfig”: {
“mode”: “row-based”,
“facets”: []
},
“columnName”: “VAR_0493”,
“expression”: “value”,
“edits”: [
{
“fromBlank”: false,
“fromError”: false,
“from”: [
“LICENSED PRACTICAL NURSE”,
“NURSE, LICENSED PRACTICAL”,
“NURSE, PRACTICAL, LICENSED”
],
“to”: “LICENSED PRACTICAL NURSE”
},
{
“fromBlank”: false,
“fromError”: false,
“from”: [
“BROKER ASSOCIATE”,
“ASSOCIATE BROKER”
],
“to”: “BROKER ASSOCIATE”
},
{
“fromBlank”: false,
“fromError”: false,
“from”: [
“INTERN PHARMACIST”,
“PHARMACIST INTERN”
],
“to”: “INTERN PHARMACIST”
},
{
“fromBlank”: false,
“fromError”: false,
“from”: [
“REGISTERED NURSE”,
“NURSE, REGISTERED”
],
“to”: “REGISTERED NURSE”
},
{
“fromBlank”: false,
“fromError”: false,
“from”: [
“BOXING SECOND”,
“SECOND (BOXING)”
],
“to”: “BOXING SECOND”
},
{
“fromBlank”: false,
“fromError”: false,
“from”: [
“PROFESSIONAL ENGINEER”,
“ENGINEER, PROFESSIONAL”
],
“to”: “PROFESSIONAL ENGINEER”
},
{
“fromBlank”: false,
“fromError”: false,
“from”: [
“PROFESSIONAL COUNSELOR”,
“COUNSELOR, PROFESSIONAL”
],
“to”: “PROFESSIONAL COUNSELOR”
}
]
}
]

This JSON can be parsed and compiled into R code using the following script:

#!/usr/bin/env python2
import json
import sys
import os
if len(sys.argv) < 2:
print “USAGE: ./utils/open_refine_to_R.py [edits.json] > r_file.R”
exit(1)
json_file = sys.argv[-1]
#conversions = json.load(open(“state_clustering.json”))
conversions = json.load(open(json_file))
function_name = os.path.splitext(os.path.basename(json_file))[0]
print “%s = function(df) {“ %function_name
for conv in conversions:
edits = conv[‘edits’]
columnName = str(conv[‘columnName’])
for edit in edits:
froms = edit[‘from’]
to = edit[‘to’]
for source in froms:
source = str(source)
to = str(to)
print “ df[df[, %s] == %s, %s] = %s” %(repr(columnName),
repr(source), repr(columnName), repr(to))
print “ df”
print “}”

Which can be run from the terminal as follows:

python open_refine_to_R.py edits.json > r_file.R

This creates an R file containing a function that takes in a data frame, and returns data after the replacements have been applied.

Now, we can just include this script into our existing R pipeline to create a consistent data flow.

Before we close, two notes are in order:

  1. There is a lot of potential to improve the performance of the resulting script. We may write about that in a later post.
  2. Performing the above textual cleaning resulted in a significant improvement in model performance on the actual Kaggle contest.

How this post came to be? Every week at Optima, everyone on the team gets five minutes or so to share a “nugget” of data science, algorithms or related knowledge. The only rule is that it can be explained and grasped in 5 to 10 minutes. Lately we decided to share these nuggets with the world. So here we are.