Data Quality 101

Oswin Rahadiyan Hartono
Data Folks Indonesia
4 min readOct 27, 2022
Source: Paxels

What is Data Quality?

Data quality refers to the condition of data measured against various factors or metrics. By monitoring data quality, we can pinpoint potential data issues and guarantee that the data is usable and trustworthy for data users.

What are Data Quality Metrics?

Data quality can be measured through various metrics, including:

  • Uniqueness
  • Completeness
  • Accuracy
  • Consistency
  • Timeliness

Uniqueness

Uniqueness is a data quality measure that ensures there are no duplications within a dataset, meaning each data record should be unique. It’s a fundamental metric used to enhance the quality of data.

Achieving a high level of uniqueness in data requires deduplication, a process involving various techniques and approaches.

In our exploration, we’ll delve into deduplication techniques using SQL and Python. Before removing duplicated rows, it’s prudent to validate the data for any duplication. Once validated, we can proceed with implementing deduplication techniques.

Check the duplicates (BigQuery Standard SQL)

SELECT
column_1, column_2, column_3, COUNT(1)
FROM table_name
GROUP BY 1,2,3
HAVING COUNT(1) > 1

Deduplication with SQL Distinct

SELECT
DISTINCT column_1, column_2, column_3
FROM table_name

Deduplication with SQL Aggregate Function

SELECT
column_1, column_2, column_3, COUNT(1)
FROM table_name
GROUP BY 1,2,3
HAVING COUNT(1) = 1

Deduplication with SQL Window Function

WITH CTE AS (SELECT
column_1, column_2, column_3,
ROW_NUMBER() OVER (PARTITION BY column_1, column_2, column_3 ORDER
BY column_1) AS duplicate_count
FROM table_name
)SELECT *
FROM CTE
WHERE duplicate_count = 1

Deduplication with Python Pandas

import pandas as pd# Load the data into a DataFrame
df = pd.read_csv('sample.csv')
# Selecting duplicate rows based on all columns
duplicate = df[df.duplicated()]
display(duplicate)
# Deduplication
df.drop_duplicates(inplace = True)

Completeness

Data completeness is measured by the percentage of missing data entries. For example, a column with 500 rows and 100 missing values would have a completeness degree of 80%. A missing value is typically defined as a null value, especially for primary key columns where null values are not permissible.

To address null values, standardization can be applied with certain conditions. For instance, null values in numeric columns can be replaced with 0, while null values in string columns can be replaced with an empty string or another agreed-upon value.

It’s important to note that not every missing value requires replacement. Agreement within the team regarding expected and acceptable values is crucial, especially for columns designed to allow null values.

Monitoring the daily trend of data is also essential. For example, tracking and comparing total rows over time can reveal significant drops on particular days, signaling a need for further investigation or the establishment of completeness thresholds.

Below is a query to check nullness at the column level using BigQuery Standard SQL.

SELECT
COUNT(column_1) AS id_count,
COUNTIF(column_1 IS NULL) AS id_null_count
FROM table_name

To check the nullness of an entire table (all columns)

SELECT 
column_name,
COUNT(1) AS null_count
FROM table_name AS b,
UNNEST(REGEXP_EXTRACT_ALL(TO_JSON_STRING(b), r'"(\w+)":null')) AS column_name
GROUP BY column_name

To replace the null to a specific value

SELECT
IFNULL(column_1,"this is null") AS column_1
FROM table_name

Accuracy

Accuracy in data pertains to the correctness of data formats and types.

For instance, if a column is expected to contain dates in the format DD-MM-YYYY with the data type DATE, it’s imperative to ensure that all rows adhere to this format and data type. This requires standardizing the column’s data type and validating its values.

Moreover, establishing standards for expected values — whether in lowercase, uppercase, or defined by regular expressions — is essential. Predefined rules for data cleansing, such as logical criteria for valid phone numbers, can also enhance accuracy.

In terms of data type standardization, SQL offers a useful tool: casting. This function converts the data type of a column to the specified data type.

Data type conversions are typically as follows:

  • Numerical columns ideally should be integers or floats.
  • Textual columns should ideally be strings.
  • Time-related columns should ideally be of types date, datetime, or timestamp.
  • Boolean columns should ideally be boolean.
SELECT
CAST(column_1 AS INTEGER) AS column_1,
CAST(column_2 AS STRING) AS column_2,
DATE(column_3) AS column_3
FROM table_name

Expected values

  • Lowercase / uppercase standard
  • Date format
  • Conditional rules (CASE WHEN, Regex), etc
SELECT
DATE(column_4) AS column_4,
CASE WHEN column_5 = 1 THEN "One"
WHEN column_5 = 2 THEN "Two"
WHEN column_5 = 3THEN "Three" END AS column_5
FROM table_name

Consistency

In data warehousing, normalization processes may lead to multiple tables containing identical columns and information. Therefore, maintaining consistency becomes crucial in such scenarios.

Data consistency refers to whether the same data stored in different locations matches or not. This can be assessed by comparing two or more tables using statistical tests and aggregation techniques.

Consistency checks should be performed at various levels of granularity, including:

  • Consistency of column names (adherence to naming conventions and data nomenclature)
  • Consistency of data types
  • Consistency of values

By implementing techniques such as ensuring data uniqueness, completeness, and accuracy, we can minimize data inconsistency.

Timeliness

Timeliness pertains to the expectation of when information should be accessible and available. It can be quantified as the duration between when information is anticipated and when it becomes readily usable.

Establishing Service Level Agreements (SLAs) for data timeliness is essential, ideally with the agreement of stakeholders.

For example, if a dataset is expected to be updated every morning at 8 AM UTC, the SLA for that dataset is set at 8 AM UTC. Any updates occurring after this time would be considered late, and tracking latency information (possibly through the creation of a metadata table) becomes imperative. Integration with an alerting system can then trigger alerts to notify relevant stakeholders.

Conclusion

Not all datasets necessitate the same level of quality assurance. Implementing severity ratings to tests can help prioritize critical issues, thereby avoiding unnecessary delays. By differentiating between critical and non-critical datasets, we can allocate resources more effectively and enforce stricter change management procedures on crucial data. This approach prevents us from compromising speed while ensuring data accuracy where it matters most.

--

--