Discovering the Keys to Solving for Data Quality Analysis in Streaming Time Series Datasets
Focus on Data Quality for High Value IIoT and IoT Business Outcomes
by Chris Herrera
Data Quality for Time Series is Challenging
Ok, I’ll frankly admit that working with and analyzing time series datasets to achieve high value business outcomes can present substantial challenges and at times can be extremely frustrating.
Within Hashmap Labs, we are working to simplify ways to approach this issue, and today, I want to specifically provide insights into mechanisms for data quality analysis in streaming time series datasets.
The premise that we will be operating under is that the dimension of data quality that needs to be evaluated, and potentially remediated, is highly dependent on the type of analytics that needs to be performed on that data.
What Issues Will I Run Into?
There are numerous potential sources for issues with acquired time series data. We will start by looking at four key problem dimensions of data quality — they are enumerated in the infographic below:
All of the above issues lead to problems when attempting to analyze the data.
Manufacturing Use Case Example for Time Series Data Quality
To take a high level view of the problem, let’s assume that we are attempting to correlate data from a pricing system to a stream of data that is showing us that we have reduced output from one of our manufacturing lines.
In this case, we have the issue of ensuring there is enough metadata to generate criteria to join on. Additionally, we must ensure the units are correct so that we can accurately report the impact to the analysis, and finally, we must ensure that the data being used in this process was not being calculated from invalid data.
The key issue here is how do you ensure that the data that is being received is not contrived, altered, or inaccurate, either through human error or sensor malfunction.
We will dive into methods for how to assess relative quality and begin to remediate it, however, it should be noted that when this high-level situation was posited, it was done in an order that should not be missed. The analysis type was given before the required data was noted.
This is an important attribute of the problem statement.
The reason is that…
An organization should not embark down the incredibly expensive and never-ending path of ensuring…
1. that every single point is captured
2. at the highest quality
3. with the lowest latency
without the use case to back it up.
Driving Data Quality Correction with a Use Case
Definition of Real Time
It is at this point that we need to define what real-time means. It has multiple meanings depending on the context, but in data flow it has a concrete definition:
Real time can be defined as the actual time during which a process or event occurs.
This is very important when dealing with questions of latency, for example. The reason is because reducing latency often comes with a cost, either complexity of implementation, completeness of data, or increase to the power of the infrastructure that is hosting the system.
Definition of “Good”
It is the assumption by many that automatically fixing data quality issues is the correct way to proceed; however, perceived fixes do not always improve data quality.
For example, automatically interpolating data to align it to a stride or fix gaps will cause an application that is attempting to identify a trend or a seasonal component somewhere in the analytical pipeline to decimate the data or smooth it using averaging, thus causing the analysis to be done on a dataset that was derived over two generations of computations rather than one.
Similarly, attempting to historize the data through the use of outlier detection for sensor fault and automatic removal, will result in “interesting” data points being removed, due to the fact that you are looking for (how the data is being manipulated).
Additionally, performing a panel analysis on a dataset that was aligned, where the majority of the data was interpolated, could result in an impact of the results by using contrived data (see the Validity point above).
There is a high chance that a “Good” dataset that was automatically cleansed will RESULT IN OUTPUT ANALYSIS (THE END RESULT) being derived from predominantly manufactured data.
This is an example of the introduction of interpolation bias into the sampling system. Without knowing why that interpolation method was chosen or how the stride was chosen, the results will be altered based on that information.
Is There a Data Quality Issue?
A key step, after determining the analysis type that will be performed, is identifying that there is a data quality issue at all.
Again, referring to the definition of “Good” above, there is a high chance that a “Good” data set that was automatically cleansed will result in the result of the analysis being derived from predominantly manufactured data and potentially the introduction of bias into the sampling system.
There are a number of other potential issues that could arise when trying to determine the data quality of a time series set. For example, ideally every data quality metric would be an actionable metric, however, with large datasets and multiple dimensions of quality, the number of potential metrics could be in the hundreds, thus making actionable information difficult to achieve.
Distilling Quality Down to Key Dimensions
In order to effectively analyze the relative quality of a data set, distilling the potentials down to key issues is a key first step. For this, we will use the dimensions that were detailed in the introduction section:
- Validity
- Completeness
- Precision
- Timeliness
This allows you to build a composite quality score for the entire set thus allowing a user or data manager to begin to understand what dimensions are adversely affecting the data. Again, these dimensions and their relative weight on the composite score will not be constant across data sets, because, as mentioned above, there are a number of cases where a “perceived” data quality issue, is not an issue at all.
Once each of the criteria for each dimension is determined, a further drilldown can be completed. For example, regarding completeness, the following factors could be listed:
1. Empty fields
2. Missing unit of measure
3. Missing time zone
These factors could also be measured on each individual tag as well, thus allowing analysis at all levels of the data acquisition and management system.
Final Thoughts on Data Quality for Time Series
Data quality is not a simple IT task that has global rules and there is not a packaged application that can fix all data quality problems. Data quality is as much of a domain issue, as it is a technology one.
It is easy to fall in the metric trap of automating a “bad” data point to “good”, however, it is imperative that some thought is given to the downstream applications and analysis.
Additionally, it is of paramount importance that high level and actionable metrics are generated at various levels of granularity, to allow a data manager or analyst to rectify the problem.
Your Feedback is Always Welcome!
Please let me know if you have any questions or comments as you look to drive business outcomes with higher quality time series data analysis and please reach out for any questions or use case discussions on our Tempus IIoT/IoT Framework — it provides a quick path for high value outcomes and rapid application creation when dealing with time series datasets.
Feel free to share on other channels and be sure and keep up with all new content from Hashmap at https://medium.com/hashmapinc.
Chris Herrera is a Senior Enterprise Architect at Hashmap working across industries with a group of innovative technologists and domain experts accelerating high value business outcomes for our customers. You can follow Chris on Twitter @cherrera2001 and connect with him on LinkedIn at linkedin.com/in/cherrera2001.