How to spot bad data in your datasets ?

Shanga Saadallah
4 min readJan 23, 2024

What is bad data and how it will cost you a lot when you are late?

Business data are data you collect through your business operations and processes. Data is an asset of any business and with wise investment in data, it can generate revenue like any product does in any business. But this is only possible when you are collecting good quality data that can help in any case or business strategy goals.

Mr Chris Orwa,data Scientist with 12 years of experience building analytical solutions in the Banking, Telecom, Fintech, Logistics, E-commerce, and BPO industries. He shares:

Doesn’t matter you are at what stage of data-maturity, yet data quality is one of the most important things to consider.

He has worked with many types of data and one of the challenging part of it was data entry errors and currency changes were prone to cause inaccurate insights that could cost the company a lot.

Therefore, the first step of making data work for you is not collecting data but defining the strategy behind the use of that data and ensure good quality, however most businesses that are not collecting data under defined goals and stratigies can get benefit from the data they are collecting for some cases if only if it was good quality data.

What makes data “good” or “bad”?

By definition, data is of high quality if it is fit for its intended uses in operations, decision-making, planning, and data science.

How bad data looks in your dataset?

  1. Inaccurate data → It could be due to various reasons, data entry mistake, faulty sensors and data collectors.
  2. Outdated data → For some case studies data that is old or not up-to-date can be misleading, especially in fast-changing industries. What was relevant a year ago might not hold any significance today.
  3. Incomplete data → This refers to data that is missing values or lacks certain attributes. For instance, a database of customers where some entries don’t have contact numbers or addresses would be considered incomplete.
  4. Duplicate data → Sometimes, the same piece of data can be entered into a database multiple times. This can skew analysis and result in inefficiencies in operations.
  5. Inconsistent data → This arises when different parts of an organization use different formats or units for the same type of data. For example, one department might record data in metric units while another uses imperial units.
  6. Irrelevant data → This pertains to data that does not add value to the particular context or analysis at hand. Having excess irrelevant data can make data processing slower and more cumbersome.
  7. Unstructured data → While not “bad” in the traditional sense, data that isn’t structured (like plain text) can be hard to analyze without the right tools or processes.
  8. Non-compliant data → Especially in industries where data governance and regulations are strict (like healthcare or finance), using or storing non-compliant data can have significant legal and financial repercussions.

How to transform bad data to good quality data?

There are two methods

Preventive methods : to prevent the occurrence of data quality issues by enforcing data quality rules, standards, and policies at the source of data generation or collection. For example, you can use data validation, data profiling, and data governance techniques to ensure that the data is accurate and consistent from the beginning.

Corrective methods: aim to fix the existing data quality issues by applying data cleaning, data transformation, and data enrichment techniques. For example, you can use data cleansing tools, such as Trifacta, Talend, and Pentaho, to remove, replace, or modify the erroneous or incomplete data.

How much does poor data quality costs businesses?

The use of data in businesses will become a part of any business and having poor quality data will be very costy. According to Gartner research, “the average financial impact of poor data quality on organizations is $9.7 million per year.” IBM also discovered that in the US alone, businesses lose $3.1 trillion annually due to poor data quality.

What are the top tools for ensuring data quality?

1. Astera

2. Talend

3. IBM InfoSphere

4. Data Ladder

5. Ataccama ONE

6. Experian Aperture Data Studio

7. OpenRefine

8. Informatica

Criteria for Selecting the Right Data Quality Tools

  • Scalability and Performance.
  • Data Profiling and Cleansing Capabilities.
  • Data Monitoring Features.
  • Seamless Integration with Existing Systems.
  • User-Friendly Interface.
  • Flexibility and Customization Options.
  • Vendor Support and Community.
  • Pricing and Licensing Options.

If your data is just in an excel sheet then data validation will do a great job in preventing bad data and you might just need to start considering basic practices to avoid big issues later and according to business data complexity and use cases a data quality tool can be determined to generate only good quality data.

Stay tuned for our next topic and save your business from a lot of data challenges!

--

--