Why Basic Outlier Detection Doesn’t Work

Brian Mearns
CollibraDQ
Published in
2 min readSep 8, 2018

Using statistics to solve data quality problems is a compelling idea. It is especially compelling if you’ve tried a rules-based solution and ended up managing a large static rule base. Businesses need a way to surface incorrect data values that go beyond nulls, empty, missing, and malformed values. They also need a way to evolve with their ever-changing data. A simple example could be a few of the following:

Example 1
Example 2

Columnar statistics have been around for a long time and offer great insight to descriptive analytics problems. In the examples above we can see that column level statistics provide very minimal value in solving a data quality problem. In this example HMNY a penny stock will commonly trade at around 0.02 cents a share while Berkshire Hathaway will trade around 321,000 dollars a share. The min, max, mean etc… tell us almost nothing of value. This is where basic outlier detection which is based solely on column level analysis isn’t enough, as it will produce bad signal to noise ratio and leave the end user without the compelling insight. To solve this more elegantly we need to broaden the scope of the problem and look at neighboring columns. The team at Owl Analytics is passionate about solving this problem. We are constantly evolving algorithms that fitness tests the surrounding columns to measure the strength of the relationship they impose on one another. The internal optimizer will determine the best path based on the lowest error rate and begin to learn from the column values both above below and left to right. The longer Owl observes a dataset the smarter it gets. This has shown significant benefits over rules in the area of weather trends, energy trends, financial data and engine data. Connect with us on LinkedIn or Owl-Analytics.com to see more examples of how ML can be used as a practical application to solving data quality.

visit www.owl-analytics.com for more information

--

--

Brian Mearns
CollibraDQ

Co-Founder and Engineer. Interested in Solving Problems to Save Time and Money. www.linkedin.com/company/owl-analytics