Data Profiling: A Tutorial of Initial Dataset Quality Assurance

Okta "Oktushka" N.
Machine Learning & Data Analytics
8 min readSep 8, 2021

In a data analytics pre-requisite, upon receiving a dataset, a data scientist may want to do initial checks on the dataset to verify the dataset’s profile against the information published/announced by the dataset provider — before further analysis being done. Or in other words, data profiling double-checks the dataset against the description claimed by the provider. This needs to be done since human errors are common and a dataset of sufficient quality would improve the analytics outcome.

The high-level objectives of data profiling are as follows:

  1. Structure discovery: It checks the data consistency and format compliance
  2. Content discovery: It checks the data quality. For example, in a parcel delivery dataset, if an address is incorrectly formatted, customers may not be able to be reached
  3. Relationship discovery: It checks for connections between different columns or attributes and/or datasets

The first two objectives above (structure & content discovery) can be achieved by inferring statistics summary such as count of nulls, maximum value, minimum value, and frequency distribution. Additionally, data type and format analysis, cross-column analysis, and inter-table analysis may also be conducted.

While the third objective can be attained by checking the dataset for normalization violations (similar checks as in relational database table normalization); 1NF, 2NF, and 3NF verifications.

Structure & Content Discoveries on Dataset about Obesity Levels in Mexico, Peru, and Colombia

Firstly, download the respective dataset from this link: https://archive.ics.uci.edu/ml/machine-learning-databases/00544/ObesityDataSet_raw_and_data_sinthetic%20(2).zip

Step 1. Load the dataset on a Python Integrated Development Environment (IDE) like Jupyter Notebook.

import pandas as pd
df = pd.read_csv(“ObesityDataSet_raw_and_data_sinthetic.csv”)

Step 2. Verify the attributes in the dataset against the dataset provider’s information.

Show the first 50 rows of the dataset, then observe and compare each column’s values’ datatypes (text, numerical, or categorical) against the information given by the source article at this link: https://www.sciencedirect.com/science/article/pii/S2352340919306985?via%3Dihub

df.head(50)

Figure 1. Snapshot of the First 50 Rows of the Obesity Data (cropped for brevity)

Attributes Verification Outcome

  • Gender (text & categorical): Male/Female. Conclusion: synchronized with the provider article
  • Age (numerical): age. Conclusion: synchronized with the provider article
  • Height (numerical): height. Conclusion: synchronized with the provider article
  • Weight (numerical): weight. Conclusion: synchronized with the provider article
  • family_history_with_overweight (text & categorical): Whether a family member suffered or suffers from overweight; yes/no. Conclusion: synchronized with the provider article
  • FAVC (text & categorical): High calorie intake; yes/no. Conclusion: synchronized with the provider article
  • FCVC (numerical & categorical): Whether the person eats vegetables in their meals; Never (1), Sometimes (2), and Always (3). Conclusion: synchronized with the provider article
  • NCP (ambiguous, could be numerical or categorical): Number of main meals taken daily. The provider article says that there are three possible answers; between 1 & 2 meals (1), 3 meals (2), and more than 3 meals (3), however, we notice ‘4’ values in the rows, thus this column is not synchronized with the info given by the provider article. This column is ambiguous, because it could be implied that the values are the actual number of meals or there is an unexplained category ‘4’. Conclusion: not synchronized with the provider article
  • CAEC (text & categorical): Whether a person eats any food between meals; No/Sometimes/Frequently/Always. Conclusion: synchronized with the provider article
  • SMOKE (text & categorical): Whether a person smokes; yes/no. Conclusion: synchronized with the article
  • CH2O (numerical & categorical): The amount of water a person drinks daily; Less than a liter (1), Between 1 and 2 L (2), and More than 2 L (3). Conclusion: synchronized with the article
  • SCC (text & categorical): Whether a person monitors their daily calories; yes/no. Conclusion: synchronized with the article
  • FAF (numerical & categorical): Frequency of weekly physical activity; I do not have (0), 1 or 2 days (1), 2 or 4 days (2), and 4 or 5 days (3). Conclusion: synchronized with the article
  • TUE (numerical & categorical): Frequency of technological devices usage such as cell phone, videogames, television, computer and others; 0–2 hours (0), 3–5 hours (1), and More than 5 hours (2). Conclusion: synchronized with the article
  • CALC (text & categorical): Frequency of drinking alcohol; I do not drink/Sometimes/Frequently/Always. Conclusion: synchronized with the article
  • MTRANS (text & categorical): Mode of transportation; Automobile/Motorbike/Bike/Public Transportation/Walking. Conclusion: synchronized with the article
  • NObeyesdad (text & categorical): The provider article does not provide the description, but by observing the values in the dataset, we could imply that this column represents obesity level; Normal_Weight/Overweight_Level_I/Overweight_Level_II

From the above attributes verification conclusions, it can be inferred that data profiling may discover uncertain attributes or columns, which are not explained by the dataset provider, in this case, NCP and NObeyesdad.

Step 3. Select a numerical attribute (Weight) to analyze and count the number of its observations or rows (including null values).

length = len(df[“Weight”])
print(length)
2111

Step 4. Count the number of its observations or rows (excluding null values).

observations = df[“Weight”].count()
print(observations)
2111

Conclusion: the dataset is small because it covers three countries.

Step 5. Count the number of missing values.

numberOfMissingValues = length — observations
print(numberOfMissingValues)
0

Conclusion: no missing values is better for analytics; this will help the work of machine learning algorithms.

Step 6. Identify the minimum value.

print(“Minimum value: “, df[“Weight”].min())
Minimum value: 39.0

Step 7. Identify the maximum value.

print(“Maximum value: “, df[“Weight”].max())
Maximum value: 173.0

Conclusion: The extreme difference between minimum and maximum values infers that the dataset may mix up data from people of different ages; teenagers and adults. This may not be the best practice, a better dataset would have bespoke data i.e., either about teenagers or adults. A bespoke data would deliver more accurate analytics outcome since there is less distraction in data. This assumption of data mixture will be verified in the next step

Step 8. Data Mixture Assumption Verification.

The previous assumption about mixing teenagers and adults data can be verified by identifying the minimum and maximum age values as follows:

print(“Minimum age: “, df[“Age”].min())
Minimum age: 14.0
print(“Maximum value: “, df[“Age”].max())
Maximum value: 61.0

Conclusion: Since the minimum age (14) implies a teenager and the maximum age (61) implies an adult, therefore, the assumption is proven to be correct. In case this data mixture will still be used in data analytics, then it is recommended to add more attributes that may further distinguish obesity factors between teenagers and adults, such as metabolism rate; the amount of energy disbursed over a period of time, the unit may be in joules, calories, or kilocalories.

Step 9. Identify the most frequent value (mode).

print(“Mode value: “, df[“Weight”].mode())
80.0

Conclusion: 80 kg is a common body weight among teenagers and adults, therefore this mode value is expected.

Step 10. Identify the average value (mean).

print(“Mean value: “, df[“Weight”].mean())
Mean value: 86.58605812648037

Conclusion: the average weight of 86.6 kg is not far from the mode value (80 kg), hence it is expected.

Step 11. Identify the median value.

print(“Median value: “, df[“Weight”].median())
Median value: 83.0

Conclusion: the median weight of 83 kg is not far from the mode value (80 kg), thus it is also an expected value.

Step 12. Identify the standard deviation value.

print(“Standard deviation value: “, df[“Weight”].std())
Standard deviation value: 26.19117174520469

Conclusion: the high deviation value supports our previous finding that the dataset mixes between teenagers and adults data.

Step 13. Calculate the quantile statistics.

quantile = df[“Weight”].quantile([.25, 0.5, .75])
print(quantile)
0.25 65.473343
0.50 83.000000
0.75 107.430682

Conclusion: 25% of the observed weights are equal or below 65.5 kg, 50% of the observed weights are equal or below 83 kg (this confirms the median value), and 75% of the observed weights are equal or below 107.4 kg (this confirms the mode value).

Step 14. Visualize the data distribution.

import seaborn as sns
sns.set(color_codes=True)
sns.set_palette(sns.color_palette(“muted”))
sns.distplot(df[“Weight”].dropna()) #excluding null values

Figure 2. Data Distribution Visualization

The resemblance of bell-like shape of the data distribution implies that the weight observations are almost normally distributed.

Step 15. Calculate the correlations among numerical attributes.

df[[“Age”,”Height”,”Weight”,”FCVC”,”NCP”,”CH2O”,”FAF”,”TUE”]].corr()

Figure 3. Correlation Scores among Numerical Attributes

From the above Pearson correlation scores, it is noticed that height has the highest correlation with weight (0.46), FCVC (vegetable intake) has the 2nd highest correlation with weight (0.216), and CH2O (water intake) has the 3rd highest correlation with weight (0.200).

The correlation findings quite make sense, however, in general, no attributes are significantly correlated with weight. This may be due to deficiency in dataset size.

Relationship Discovery on Dataset about Obesity Levels in Mexico, Peru, and Colombia

After completing the structure and content discoveries on the same dataset, now we will do relationship discovery to continue our data profiling journey.

Step 1. Identify the functional dependencies among the attributes.

height, weight, family_history_with_overweight, FAVC, NCP, CAEC, SMOKE, SCC, FAF, CALC, and MTRANS determine NObeyesdad (based on personal assumption and it has not been verified with a medical doctor, nutritionist, or biologist)

Step 2. Normalize the dataset if necessary.

  • First Normal Form (1NF) Violation Check:

The dataset does not contain multi-valued attribute, thus it is in 1NF.

  • Second Normal Form (2NF) Violation Check:

Based on the identified dependencies in Step 1 before, it is assumed that NObeyesdad depends on the entire determinants a.k.a. no partial dependency, hence the dataset is in 2NF.

  • Third Normal Form (3NF) Violation Check:

Transitive dependency does not exist, especially because there is only one dependent attribute (or non-UID attribute in relational database term), therefore the dataset is in 3NF.

Since the dataset has been concluded to be in 3NF, then for this reason, the dataset can be declared to be in normalized form. In other words, no further normalization required.

Summary

  1. The outcome of structure and content discoveries suggests that the dataset size is not sizeable enough. For a dataset that claims to possess observations from three countries (Mexico, Peru, and Colombia), 2,111 samples are considered insufficient, especially for a medical dataset — an area known to be life changing.
  2. The attributes verification outcome discovers that NCP column is inaccurately explained by the dataset provider and at the same time, they also do not explain NObeyesdad column in the published article.
  3. It is also found that the dataset mixes between teenagers and adults data, which may distort the analytics, unless more distinguishing attributes are added e.g., metabolism rate.
  4. The Pearson correlation scores may infer that height, vegetable intake, and water intake are the most correlated with weight. This sounds logical although the respective scores are all below 0.5.
  5. The relationship discovery verifies that the dataset is already in 1NF, 2NF, and 3NF (fully normalized) since the beginning.

Final Words

All the above explained steps would kickstart your data profiling journey, however, more profiling steps could be done, such as the ones mentioned below.

  • Unique ID (UID)/primary key identification: Check the attributes and their data if one or more of them are identifiers or composite key (in case of multiple UIDs)
  • Foreign key identification: Check the attributes and their data for possibility that one or more of them are derived or imported from another dataset
  • String pattern identification: Check the attributes and their data for any repetitive string pattern
  • More additional data profiling activities can be found on this reference article: https://hpi.de/fileadmin/user_upload/fachgebiete/naumann/publications/2017/SIGMOD_2017_Tutorial_Data_Profiling.pdf

--

--

Okta "Oktushka" N.
Machine Learning & Data Analytics

SW QA, Internet of Things (IOT) Consultant, Solution Lead, TM Forum Associate. Worked at IT firm in Melbourne. Got PhD in IT from Universiti Teknologi PETRONAS