Categorical Outliers Don’t Exist

Kirk Haslbeck
Sep 9, 2018 · 3 min read

At least not without context

One of the first questions I’m always asked is “Can you detect categorical outliers?” Most people are familiar with the concept of a numerical outlier but there is often confusion regarding the topic of categorical outliers. For example, if you were presented with the 2 datasets below, could you determine the outlier in each case?

Example 1: A, B, C, X

Example 2: Apple, Orange, Pear or Blueberry, Raspberry, Strawberry, Grape

In the first example many people select X as the outlier. This is because the human brain is so sophisticated that it can immediately infer a context, our brains recognize the pattern and associate the domain of the English alphabet. Yet in fact there is no discernible outlier in example #1. In the second example someone might chose Orange because science suggests that a Pear is genetically closer to an Apple than an Orange. In the following example we might guess Grape since it’s not a berry. The fact is no outlier exists in the examples above, for an outlier to exist there must be a measure of distance. This is intrinsic to numeric data types because there is a Euclidean distance between numbers.

Image for post
Image for post

K-means to the rescue? A few google searches for categorical outliers and you’ll find people talking about k-means clustering. The k-means algorithm is an unsupervised learning technique solely designed to be used on numeric types. K-means is based on measuring distance from the centroid and the centroid needs to be a numeric value. I’ve even seen some creative solutions where a data scientist will use techniques like string indexing or one hot encoding on a categorical value to convert it from a string value to a numeric value and then use k-means. In the research we’ve conducted at Owl Analytics this is not an acceptable approach because the numeric vector assigned to the String value during the one hot encoding practice still lacks the domain context. It does satisfy the input constraint of the k-means model and the model will produce an output. However the output will not select the appropriate outlier, as the input was not truly valid. K-modes is a promising new algorithm that primarily only exists in the python community today. K-modes is designed to handle categorical values without the need for String Indexing or One Hot Encoding. After reviewing and testing many approaches to this problem we developed our own methodology which we have found to be both more performant and yield the proper outlier results across many diverse datasets. The datasets focused on in our back testing were: address information including states, cities, counties and zip codes; stock datasets including asset classes, symbols, exchanges; customer and user tables including: first name, last name, social security number and phone number. At Owl our focus is primarily to correct data for the purpose of predictive data quality. After running numerous back tests on a variety of datasets we used concepts like term frequency and minority classes. Blending these techniques with traditional outlier detection methods enabled us to not only predict more accurate outliers but also adapt over time much like the English language does. In the real world we face concepts like the “Urban Dictionary” where words or terms are constantly evolving. A word like survey might migrate to surveil and then surveillance. Categorical outliers are not static which is precisely why our study suggested the need for a blended technique.

If you’re working on an interesting project involving Categorical Outliers connect with us on LinkedIn ( ) OR and we’ll share more information regarding our approach.

Owl-Analytics

Predictive Data Quality — The fast and elegant way to…

Kirk Haslbeck

Written by

Founder Owl Analytics — Using Data Science to Solve Data Quality

Owl-Analytics

Predictive Data Quality — The fast and elegant way to manage data. Owl auto learns data trends to find data issues. Owl reduces most of the manual human process of writing rules to manage datasets. Use data science to solve data quality. Stop reacting. Start Predicting.

Kirk Haslbeck

Written by

Founder Owl Analytics — Using Data Science to Solve Data Quality

Owl-Analytics

Predictive Data Quality — The fast and elegant way to manage data. Owl auto learns data trends to find data issues. Owl reduces most of the manual human process of writing rules to manage datasets. Use data science to solve data quality. Stop reacting. Start Predicting.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface.

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox.

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store