Data Quality Issues and fixing them.

Rohan Shah
Google Cloud - Community
9 min readDec 30, 2022
Is your data good enough ?

It’s often claimed that data has overtaken oil as the world’s most valuable resource. Whether or not this is true, it’s important to note that, just like oil, data is useless until it’s mined and handled (i.e. analysed) in the right way.

Data-driven organisations are depending on modern technologies and AI to get the most out of their data assets. According to O’Reilly’s report on The state of data quality 2020, 56% of organisation's face at least four different types of data quality issues, while 71% face at least three different types.

As per the current trend organisations spend quite a lot of time, resources and money while designing data quality frameworks and fixing data quality issues. But to get good results, it is important for them to understand the exact nature of these issues and identify how do they end up in the system in the first place.

Moreover, complex data quality frameworks are designed and advanced technology is adopted to ensure fast and accurate data quality management. All these efforts are done in the hopes of making the clean data dream come true. But none of this can be possible without understanding what is polluting the data in the first place and where exactly it is coming from. 100% accuracy and completeness doesn’t exist, which is also not the point. Instead, the point is to pick your battles and improve quality to an acceptable threshold.

Let’s look at some of the Data Quality (DQ)issues that are commonly present in a company’s organisational data.

  1. Data is Incomplete

This is the most common issue which industries come across when dealing with DQ. Some of the common issues that contribute towards it could be key columns having missing information or detailedness, information not available for all the attributes or records, partial failure of ETL job for some of the data streams that generates the data. For example missing information for zip codes for some of the records because the method that is used to derive zip codes is unable to derive it for some records.

The best way to fix this is to put in place a reconciliation framework control. The control would check the number of records passing through your analytical layers and alert when records have gone missing.

2. Data Overload

While we focus on data-driven analytics and its benefits, too much data might not seem to be a data quality issue, but it is. Overwhelming the system with loads of data buries the key insights and adds in the irrelevant data. The additional overhead of capturing, organising, and sorting all this data is not only an expensive process but is ineffective too. This load of data makes it difficult to analyse the trends and patterns, identify the outliers and introduce changes due to the ample amount of time that it takes. Data coming from different sources need to be cleaned by filtering out the irrelevant data and organising it properly. Other data quality issues become more severe with the increasing volume of data, especially with streaming data and large files or databases. This technique ensures that your data is relevant yet complete.

To fix this, data capturing principles need to be agreed upon. Each data attribute should have an end goal. Otherwise, it should not be captured. This makes sure that only required and relevant data is captured which avoids overloading or adding too much of data.

3. Duplicate Data

This issue looks straightforward to spot but quite tricky to fix. Modern organisations face an onslaught of data from all directions — local databases, cloud data lakes, and streaming data. Additionally, they may have application and system silos. There is bound to be a lot of duplication and overlap in these sources. If the critical attribute is populated with dirty data duplicates, it will break all the key downstream processes. It can also cause other DQ issues. For instance duplication of contact details, for example, affects customer experience significantly. Marketing campaigns suffer if some prospects get missed out while some may get contacted again and again. Duplicate data increases the probability of skewed analytical results. As training data, it can also produce skewed ML models.

To fix this we can follow one the recommended approach. Creating a master data management control, even as basic as a uniqueness check. This control will check for exact duplicates of records and purge one record. It can also send a notification for the other record to the data engineer or steward for investigation. The other way could be using Rule-based data quality management can help you keep a check on duplicate and overlapping records. With predictive DQ, rules are auto-generated and continuously improved by learning from the data itself. Predictive DQ identifies fuzzy and exactly matching data, quantifies it into a likelihood score for duplicates, and helps deliver continuous data quality across all applications.

4. Inconsistent Keys

Imagine building a new data warehouse with Primary & Surrogate keys for your core data model. Once the data warehouse matures and receives new data daily, including seasonal peaks, you realise that the natural keys are not unique. This finding could break the model’s design, leading to a breach of referential integrity.

To fix this, comprehensive profiling of the data has to be carried out, including seasonal data, to ensure the key on which the surrogate key is dependent is always unique.

5. Orphaned data

This issue relates to data inconsistency problems where data exists in one system and not the other. To explain this issue consider the scenario where a customer exists in table A, but their account doesn’t exist in table B. It would be classed as an orphan customer. On the other hand, if an account exists in table B but has no associated customer, it would be classed as an orphan account.

To fix this we need a data quality rule in place that checks for consistency each time data is ingested in tables A and B will help spot the issue. Also it is important to remediate this by checking the underlying cause of this inconsistency.

6. Default Values

We often find data some date values back to 01/01/1900 as a date for a birthdate or sometimes date of transaction in worst case ? Unless your customer base actually comprises 130-year-old individuals or you have transactions that are 130 years old, this is likely a case of using default values. If we look closely the sevirity of this issue can be quite serious as it could lead to inaccurate results for an ML model because of data outliers.

The best way to fix this is to profile the data and understand the pattern of why default values were used. Usually, engineers use this data when a real-life alternative date is unavailable. Having a documentation around the default values can also help the data engineers and data science team to accurately use this values and consider it while reporting or running ML models.

7. Data Formats Inconsistencies

Mismatches in the same information across multiple data sources can lead to data inconsistencies. Consistency is important to correctly leverage the data. The inconsistencies may arise from different units and languages. For example, the distance may be expressed in km while meter was required. String columns predominantly suffer from this problem, where data can be stored in many formats. For example, a customer’s first and last name is stored in different cases or an email address without the correct formatting. This messes up all the operations of the business and needs to be addressed at the source so that the data pipelines provide trusted data. This occurs when multiple systems store information without an agreed data format.

To fix this, data needs to be homogenised (standardised) across the source system or at least in the data pipeline when fed to the data lake or warehouse.Therefore we need to make all the desired conversions before the migration and introduce the validity constraints. Constant monitoring of the data quality can also help you identify these inconsistencies.

8. Data transformation errors
Converting data from one format to another can lead to mistakes. As a simple example, you may have a spreadsheet that you convert to a comma-separated value, or CSV file. Because data fields inside CSV files are separated by commas, you may run into issues when performing this conversion in the event that some of the data entries in your spreadsheet contain commas inside them. Unless your data conversion tools are sufficiently smart, they won’t know the difference between a comma that is supposed to separate two data fields and one that is an internal part of a data entry. This is a basic example; things get much more complicated when you must perform complex data conversions, such as taking a mainframe database that was designed decades ago and converting it to NoSQL.

The best way to fix this is to put in place a reconciliation framework control. Having checks on source and target will make sure the transformations carried are correct.

9. Data Downtime

Durations when the data is in a partial, erroneous, or inaccurate state, refer to the data downtime. It is extremely costly for data-driven organizations that heavily rely on behavioral data for running their operations. Some common factors that may cause data downtime are unexpected changes in the schema, migration issues, network or server failure, Incompatible data, etc. But, what’s important is to measure the downtime continuously and minimize it through automated solutions.

Downtime can be eliminated by introducing Data observability from source to consumption. Data observability is the organisations ability to understand data health and improve it by employing best practices. Accountability and putting in SLAs can help control data downtime. But what you really need is a comprehensive approach to ensuring constant access to trusted data. The predictive DQ can track issues to continuously deliver high-quality data pipelines, always ready for operations and analytics.

10. Data Received Too Late

It is very important that the correct data is made available to the correct team at the correct time. Hence data needs to be timely enough to make the critical decision in that period. For instance if your marketing campaigns are running weekly, you must receive the required data by the set day of the week to trigger them. Otherwise, too late data could lead to poor or improper responses on your campaigns.

You must agree on an appropriate time window with the engineering team to fix this. And work backwards to ensure your source systems can adhere to those SLAs (Service Level Agreements).

11. Human Error

Even after having all the automation, data is still typed on various web interfaces. Hence, there is a high possibility of typographical mistakes leading to inaccurate data. This data entry can be done both by the customers and the employees. Customers may write the correct data into the wrong data field. Similarly, employees may make a mistake while handling or migrating the data. Experts recommend automating the process to minimize the involvement of data capture from humans.

Below are few ways through which we can tackle this issue :
1. Real-time validation of forms using data quality tools.
2. Proper training for the employees.
3. Using definitive lists to lock down what the customers can enter.

12. Hidden data

Most companies are using only about 20% of their data when making business intelligence decisions, leaving 80% to sit in a metaphorical dumpster. For example, customer data available with sales may not get shared with the customer service team, losing an opportunity to create more accurate and complete customer profiles. Hidden data means missing out on discovering opportunities to improve services, design innovative products, and optimise processes. Hidden data are most beneficial in regards to customer behavior. Customers interact with companies today in a variety of mediums, from in-person to over the phone to online. Data can be invaluable on when, how, and why customers interface with a company, but it is rarely utilized.

Capturing hidden data with a tool like the Datumize Data Collector (DDC) can give many more insights into the hidden data you have obtained. You can also consider investing in a Data catalog solution. Centralising your data is also another effective way to overcome this problem.

Conclusion

Gosh — that was a long list. Unfortunately, this is still not exhaustive enough to handle every possible DQ issue. Getting DQ right is many organisations’ top priority, and investing in this will pay dividends. The best way to think about data quality problems is to recognize them as inevitable. It’s not because your data management process is flawed that you have data quality problems. It’s because the types of data issues described above are impossible for even the best run data operation to avoid.If you have encountered more DQ issues in your day-to-day, please share them by leaving a comment below.

If you enjoyed reading this post or found it useful in any way, please drop some claps and comments. I would love to hear your feedback and encouragement to write more on Data Engineering.

Feel free to engage and connect with me on LinkedIn

--

--