ML to supercharge data quality validation processes

Max Lukichev
Telmai
Published in
7 min readJul 31, 2023

Maintaining high data quality is imperative in the ever-expanding data landscape. Machine learning techniques step in, providing innovative solutions to enhance and automate data quality validation. This post delves into how machine learning can help improve the accuracy and efficiency of your data validation processes.

Here are some ways to utilize ML for data quality validation:

  1. Data profiling and feature engineering
  2. Anomaly Detection
  3. Data Cleansing
  4. Data Interpolation
  5. Deduplication
  6. Data Validations
  7. Error Prediction
  8. Continuous learning and improvement

Data profiling and feature engineering: Machine learning can assist in profiling data and extracting relevant features. Machine learning algorithms can identify patterns by analyzing data distributions and correlations, thereby predicting data quality metrics such as completeness, uniqueness, or distribution and highlighting potential quality issues or anomalies for further investigation. These features can then be used to build robust data quality validation models.

Anomaly Detection: Machine Learning algorithms can be trained to identify anomalies like outliers and drifts in the data via historical analysis and reviewing existing datasets.

Anomaly detection techniques can help uncover data quality issues such as out-of-range values, unexpected patterns, duplicates, etc.

There are two different approaches to this:

Anomalies based on historic learnings using Time-Series Analysis: Time-series data offer a gold mine of information. For time-series data, specific techniques can be employed to detect anomalies:

Statistical Methods: Statistical approaches such as Moving Averages and Linear Regression can be used to identify instances that deviate significantly from the expected statistical distribution of the data. The concept behind moving averages is to smooth the time-series data better to see patterns, seasonal effects, and trends. This is particularly useful in detecting outliers or unusual spikes or drops in the data by calculating the average of the data points in a given time window that ‘moves’ along with the time-series data, thus the term ‘moving average’.Any data point significantly deviating from the moving average could be considered an anomaly.

Linear regression: This involves fitting a line to the time-series data. It tries to explain the data by establishing a linear relationship between the time and the value of the data. When a new data point deviates significantly from this linear relationship, it could be considered an anomaly.

Seasonal Decomposition: Time-series data often exhibit seasonal or periodic patterns. Decomposing the time series into its seasonal, trend, and residual components can help identify anomalies in the residual component.

Recurrent Neural Networks (RNNs): RNNs, particularly Long Short-Term Memory (LSTM) networks, can capture temporal dependencies and patterns in time-series data. They can be trained to predict the next data point, and instances with high prediction errors can be considered anomalies.

Anomalies within a Data snapshot :Anomalies within a data snapshot refer to data points or instances that deviate significantly from the expected patterns or behaviors observed in the majority of the data. Various factors, such as errors in data collection, measurement errors, system malfunctions, fraudulent activities, or rare events, can cause these anomalies.

Issues like out-of-range values are hard to catch using validation checks, and either need human-in-loop or Machine Learning based techniques.

Examples of such techniques: Isolation Forest, K-nearest neighbors, or one-class SVM, can help identify and remove or correct outliers.

Data Cleansing: Machine Learning algorithms can efficiently identify and extract different PII elements such as names, addresses, social security numbers, email addresses, phone numbers, and more from unstructured text, such as medical records, customer feedback, or online reviews.ML algorithms can recognize and mask sensitive information while ensuring uniformity by standardizing formats. By leveraging Machine Learning for PII data cleaning, organizations can comply with data protection regulations, maintain data integrity, and confidently use the data for analysis, research, and business insights while respecting individual privacy.

Data Interpolation: Machine Learning Techniques can also be applied to interpolate missing or incorrect data values. By learning patterns from existing data, Machine Learning models can predict and fill in missing values, reducing data incompleteness and improving overall data quality.

For example: Employing a suite of Machine Learning methods like Prophet can be particularly effective for estimating missing data points in time-series datasets, such as forecasting future sales in retail or predicting web traffic for resource allocation. By learning from historical patterns and trend behavior, Prophet can effectively fill gaps and provide more accurate and reliable predictions.

Deduplication: Record deduplication identifies duplicated entities in the dataset. Such records may not always be identical; for example, names can be spelled differently, and addresses can be different because they were entered at other points in time, but when considering all evidence (attributes) together, a person or system can come up with the decision of high certainty on if two records describe the same entity (ex. person) or not.

The classic approach to this problem was to write many matching rules (ex., If name and address are the same, then merge it’s the same person); sometimes rules could become very complex. But still, it would only be possible to validate if all duplicates are taken care of and taken care of correctly.

Machine Learning can be a great help in such a task. However, whatever approach is taken, it needs to be considered that there will be limited training data. One of the approaches, in this case, is Active Learning, where human inputs are used to actively retrain models after each answer, which could dramatically reduce the need for an extensive training corpus. Active learning is no silver bullet, though, and in many applications, it proved not to provide good results because you can not make up information from nothing. But for the task of record deduplication precisely, it proved very efficient.

Data Validations: Data validation checks verify the integrity and quality of data. Traditionally these data validation checks can be implemented using programming languages, SQL queries, data validation tools, or through dedicated data quality management platforms.

This approach requires constant monitoring and manual adjustments to the threshold as data trends change over time.

A Machine Learning-Based Automatic Threshold Approach addresses this gap. With an ML-based automatic threshold approach, you can leverage historical data to train a model that learns the patterns and automatically determines the acceptable threshold. The ML model can analyze historical data, identify patterns, and calculate the threshold dynamically based on statistical measures, such as standard deviation or percentile ranges. The model can automatically flag any deviation beyond the determined threshold as potential data quality issues.

Benefits of Machine Learning-Based Automatic Threshold:

  1. Adaptability: Machine Learning models can adapt to changing pricing patterns and adjust the threshold automatically, reducing the need for manual threshold adjustments.
  2. Flexibility: Machine Learning models can consider various factors, such as seasonality, promotions, or market trends, to determine the appropriate threshold dynamically.
  3. Efficiency: Machine Learning-based approaches can process large volumes of data quickly, allowing for real-time or near real-time analysis and identification of data quality issues.
  4. Accuracy: Machine Learning models can capture complex patterns and relationships in the data, enabling more accurate identification of outliers or unusual data points.

While Machine Learning models offer powerful tools for data analysis, their effectiveness hinges on customization. This isn’t a one-size-fits-all game. The Machine Learning model for automatic threshold determination requires historical training data and regular monitoring. The collaborative efforts of data scientists, domain experts, and data stewards truly fine-tune the system, interpreting Machine Learning results, validating thresholds, and managing exceptional cases or domain-specific considerations. Remember, there’s always a critical need for a platform that allows these key players to tailor models for each unique situation efficiently. In the realm of data, flexible moldability and adaptive potential are vital.

Error Prediction: Machine Learning models are akin to our memory. They learn from historical events, in this case, data errors and quality issues. By meticulously training on past data quality problems and the intricate patterns surrounding their characteristics, Machine Learning can go beyond merely reacting to data errors. It becomes a valuable, proactive shield that predicts potential issues before they take root.

Consider a manufacturing data pipeline where sensor data is constantly fed into the system. Suppose inconsistencies or outliers have occurred in the past under specific conditions. In that case, Machine Learning models can recognize these conditions and alert stakeholders of a potential error, almost like a data guardian angel foreseeing trouble. This can save significant resources that would otherwise be spent rectifying the error down the line.

Continuous learning and improvement

Continuous monitoring and drift detection: Machine Learning models can continuously monitor data quality metrics and detect drifts or changes in the data distribution over time. They can learn from historical data to identify when data quality deviates from expected patterns and trigger alerts or corrective actions.

Feedback-driven improvement: Machine Learning models can incorporate feedback from data stewards or domain experts to improve performance. Feedback can be used to refine models, update rules, and adapt to evolving data quality requirements.

Summary

Applying Machine Learning techniques for data quality validation requires quality training data, appropriate feature engineering, and careful model selection and evaluation. It’s important to note that while ML can significantly enhance data quality management, it should be used in conjunction with human expertise and validation. Human involvement is crucial for interpreting results, understanding context-specific nuances, and making informed decisions regarding data quality. Collaboration between data scientists, data stewards, and domain experts is critical to successfully applying Machine Learning techniques for supercharging data quality.
But by incorporating Machine Learning into data quality validation processes, organizations can streamline and automate validation tasks, accelerate error detection and correction, and improve overall data accuracy and reliability at a fraction of the cost.

Article is co-authored by Anoop Gopalam and originally published on Telmai

--

--