At least not without context
One of the first questions I’m always asked is “Can you detect categorical outliers?” Most people are familiar with the concept of a numerical outlier but there is often confusion regarding the topic of categorical outliers. For example, if you were presented with the 2 datasets below, could you determine the outlier in each case?
Example 1: A, B, C, X
Example 2: Apple, Orange, Pear or Blueberry, Raspberry, Strawberry, Grape
In the first example many people select X as the outlier. This is because the human brain is so sophisticated that it can immediately infer a context, our brains recognize the pattern and associate the domain of the English alphabet. Yet in fact there is no discernible outlier in example #1. In the second example someone might chose Orange because science suggests that a Pear is genetically closer to an Apple than an Orange. In the following example we might guess Grape since it’s not a berry. The fact is no outlier exists in the examples above, for an outlier to exist there must be a measure of distance. This is intrinsic to numeric data types because there is a Euclidean distance between numbers.
K-means to the rescue? A few google searches for categorical outliers and you’ll find people talking about k-means clustering. The k-means algorithm is an unsupervised learning technique solely designed to be used on numeric types. K-means is based on measuring distance from the centroid and the centroid needs to be a numeric value. I’ve even seen some creative solutions where a data scientist will use techniques like string indexing or one hot encoding on a categorical value to convert it from a string value to a numeric value and then use k-means. In the research we’ve conducted at Owl Analytics this is not an acceptable approach because the numeric vector assigned to the String value during the one hot encoding practice still lacks the domain context. It does satisfy the input constraint of the k-means model and the model will produce an output. However the output will not select the appropriate outlier, as the input was not truly valid. K-modes is a promising new algorithm that primarily only exists in the python community today. K-modes is designed to handle categorical values without the need for String Indexing or One Hot Encoding. After reviewing and testing many approaches to this problem we developed our own methodology which we have found to be both more performant and yield the proper outlier results across many diverse datasets. The datasets focused on in our back testing were: address information including states, cities, counties and zip codes; stock datasets including asset classes, symbols, exchanges; customer and user tables including: first name, last name, social security number and phone number. At Owl our focus is primarily to correct data for the purpose of predictive data quality. After running numerous back tests on a variety of datasets we used concepts like term frequency and minority classes. Blending these techniques with traditional outlier detection methods enabled us to not only predict more accurate outliers but also adapt over time much like the English language does. In the real world we face concepts like the “Urban Dictionary” where words or terms are constantly evolving. A word like survey might migrate to surveil and then surveillance. Categorical outliers are not static which is precisely why our study suggested the need for a blended technique.
If you’re working on an interesting project involving Categorical Outliers connect with us on LinkedIn (https://www.linkedin.com/company/owl-analytics ) OR www.owl-analytics.com and we’ll share more information regarding our approach.