Analysis approach of the data quality of dates using basic statistical methods (with Python)

A user-friendly approach to detect data quality issues and outliers in dates

Marc Alvarez Brotons
6 min readDec 31, 2022
Photo by Conny Schneider on Unsplash

There main problems that cause defects in data quality are related to: problems in data entry applications, incorrect migration of data from old systems, degradation of information systems, process integration problems, etc.

All of them lead to information degradation problems, generate high management costs at the operational level and have a negative impact on information analysis and decision making.

In order to detect the existence of possible data quality problems in the imported dataset, it is recommended to carry out the following set of steps:

  1. Define the use case to analyze
  2. Explore dataset
  3. Detection and cleansing of main technical errors
  4. Analysis of distribution and frequencies

1. Define the use case to analyze

It is important to keep in mind the use case to which the data to be analyzed refers, because in its analysis we must apply the technical criteria of the data as well as the functional ones. An example could be the analysis of dates of birth of customers of an online store. The range of ages should be over 18 years to 100 years. Out of this age range, the records have a high probability of being considered a possible error.

2. Explore dataset

Once the use case is known, we will import the dates of birth dataset from a public github repository. Once the dataset is loaded, we will obtain the information from the dataset, as well as a preview a sample of the data, where we can see that the dataset contains two columns: one of them is about ids and the other dates of birth on datetime format.

Additionally, we will obtain the maximum and minimum dates to get an idea of ​​the range of dates with which we are working about.

import pandas as pd

#Import dataset birthdates from github
url = "https://raw.githubusercontent.com/mabrotons/datasets/master/birthdates.csv"

df = pd.read_csv(url, index_col=0, parse_dates=['birthdates'])

print("Data frame info: ")
print(df.info())

print("\nData frame head: ")
print(df.head())

dates = df['birthdates']
print("\nMin date: " + str(min(dates)))
print("Max date: " + str(max(dates)))
Image by Author

3. Detection and cleansing of main technical errors

As a first step on the analysis, we will build a graph to represent all the data included in the dataset in order to have a first view of the data, its distribution, the range, as well as detect possible outliers.

import matplotlib.pyplot as plt

f = plt.figure()
f.set_figwidth(15)
f.set_figheight(5)

plt.hist(dates, bins=50, edgecolor='black')
plt.xticks(rotation=30)
plt.title("Dates")
plt.show()
Image by Author

Once the global vision of the date ranges from year 1801 to 2029 is available, we can see:
- some of years are out of trustly range
- the high density of dates is distributed between the years 1940 to 2000
- the graphs show a clear error close to the year 2000

If we focus on the outliers of the year 2000, we can detect that there is a possible problem with the date 01/01/2000 (dd/mm/YYYY) might be caused by a technical error or by using date file as a dummy date.

from datetime import date

dates_outlier = [item for item in dates if item > date(1999, 12, 15) and item < date(2000, 1, 15)]
f = plt.figure()
f.set_figwidth(15)
f.set_figheight(5)
plt.xticks(rotation=30)
plt.hist(dates_outlier, bins=30, edgecolor='black')
plt.title("Outliers")
plt.show()
Image by Author

Once potential outliers are removed, we can re-visualize the plot without erroneous data, making easy.

dates_clean = [item for item in dates if item > date(2000, 1, 1) or item < date(2000, 1, 1)]

f = plt.figure()
f.set_figwidth(15)
f.set_figheight(5)
plt.xticks(rotation=30)
plt.hist(dates_clean, bins=40, edgecolor='black')
plt.title("Dates without outliers")
plt.show()
Image by Author

4. Analysis of distribution and frequencies

The next step is to analyze the split of dates of the dataset, component by component. First of all, we can analyze dates (excluding previous detected outliers) from the point of view of the year, and it seems like the global view.

dates_clean_year = [item.year for item in dates_clean]

f = plt.figure()
f.set_figwidth(15)
f.set_figheight(5)
plt.hist(dates_clean_year, bins=40, edgecolor='black')
plt.title("Years")
plt.show()
Image by Author

Now, we can do the same analysis using the month as frequency.

dates_clean_month = [item.month for item in dates_clean]

f = plt.figure()
f.set_figwidth(10)
f.set_figheight(5)
plt.hist(dates_clean_month, bins=12, edgecolor='black')
plt.title("Months")
plt.show()
Image by Author

Viewing dates through the day of the month, we can detect a new outlier candidate: a large number of dates with the first day of the month compared to all other dates.

dates_clean_day = [item.day for item in dates_clean]

f = plt.figure()
f.set_figwidth(10)
f.set_figheight(5)
plt.hist(dates_clean_day, bins=31, edgecolor='black')
plt.title("Days")
plt.show()
Image by Author

Finally, we can analyze from days of week as frequency.

weekDays = ("Sunday","Monday","Tuesday","Wednesday","Thursday","Friday","Saturday")
dates_clean_weekday = [weekDays[item.weekday()] for item in dates_clean]

f = plt.figure()
f.set_figwidth(10)
f.set_figheight(5)
plt.hist(dates_clean_weekday, bins=7, edgecolor='black')
plt.title("WeekDays")
plt.show()
Image by Author

In this use case, being a dataset of dates of birth, we can do the analysis as simple dates (as we have done) or transform them to ages.

def calculate_age(birth_date):
today = date.today()
age = today.year - birth_date.year - ((today.month, today.day) < (birth_date.month, birth_date.day))

return age


ages = [calculate_age(item) for item in dates_clean]
f = plt.figure()
f.set_figwidth(10)
f.set_figheight(5)

plt.hist(ages, bins=40, edgecolor='black')
plt.title("Ages")
plt.show()
Image by Author

Now, let’s calculate the basic statistics by age and visualize them easily with a boxplot, including:
- min whisker (Q1–1.5*IQR)
- Q1 (25th percentile rank)
- median
- Q3 (75th percentile rank)
- max whisker (Q3–1.5*IQR)
- outliers at both ends of the box figure

import statistics

print("Min ages: " + str(min(ages)))
print("Max ages: " + str(max(ages)))
print("Mean ages: " + str(statistics.mean(ages)))
print("Median ages: " + str(statistics.median(ages)))
print("Standard deviation ages: " + str(statistics.stdev(ages)) +"\n")

# Creating plot
f = plt.figure()
f.set_figwidth(10)
f.set_figheight(5)

red_circle = dict(markerfacecolor='red', marker='o')
plt.boxplot(x=ages, vert=False, patch_artist=True, flierprops=red_circle, labels=['Ages']);
plt.title("Ages")
plt.show()

Conclusion

We can use basic statistical methods to perform a data quality analysis in order to detect potential errors and outliers. Coding with Python we can easily calculate the statistics and represent them graphically to help their interpretation.

Resources

I hope you liked this content. Leave a comment below if you have any questions or suggestions.

--

--