Improving Data Quality: Anomaly Detection Made Simple

Ariela Douglas
Fulcrum Analytics
Published in
3 min readSep 8, 2020

From business managers, to data scientists, to UX developers — anyone who works with data knows anomalies can be a chore to find and an even bigger chore to resolve. Incorrect or faulty data can cause a business to miss revenue opportunities or potentially make poor business decisions based on erroneous analysis. For organizations whose processes rely on unifying customer or employee data sources, or that are facing regulatory scrutiny, managing risks associated with data quality is always a high priority.

Many organizations rely on operational business rules during the ETL process to flag data that is outside of predefined parameters, but such business rules are unable to capture nuanced fluctuations in the data, such as a long period of borderline-acceptable data or a continual but gradual shift in means or standards in the data.

Accurately detecting and dealing with data anomalies is often tricky, but it doesn’t have to be.

Especially within the world of highly regulated banking, the accuracy of data of critical importance. A Data Anomaly Detector can innovate processes within a bank’s development environment to optimize data quality by:

  • Detecting anomalous distributions
  • Utilizing multiple detection algorithms
  • Alerting users to anomalies via a trigger notification system
  • “Learning” over time through model training
  • Measuring the impact of configuration/threshold changes on alert triggers
  • Executing on a regular schedule

Before deployment validation rules for incoming data require manually written and tested aggregate values of incoming data batches. A Data Anomaly Detector (DAD) improves the data anomaly detection process, working in parallel with an ETL process, alerting the operations team to any anomalies in need of attention and keeping track of them in a dashboard report. Data engineers can review and compare the flags and alerts of the two systems as they attend to the anomalies. As part of the deployment testing, artificial anomalies should be introduced to help tune the configurations.

An alternative setup is to direct data conditional on the DAD test outcomes. This is a more automated system where only validated “Good” batches are sent through the ETL process and into production, while the batches with questionable patterns or records are routed to staging for review and release prior to being integrated into the production-level dataset.

Through this deployment, a bank would be able to improve upon the accuracy of business reporting with the removal of erroneous data before it is published and utilized at the production-level. The DAD parameters should be modified and calibrated by the client’s own team going forward for maximum customization and sensitivity control.

An optimized DAD Database would contain the configuration of the machine-learning code, the user-specified Batch/Column expectations, and the record of the job and alert history. Though this example refers to the banking space, an automated Data Anomaly Detector innovates the formerly tedious task of troubleshooting outlier data for any industry, leading to data-driven decision making that is both faster and more accurate.

To read more pieces like this one, check out our blog.

--

--

Ariela Douglas
Fulcrum Analytics

Content and Marketing Specialist at Fulcrum Analytics