Is your data quality good enough to train a machine learning algorithm?
A quick aggregation of data quality checks
Machine Learning practitioners spend a lot of their time with data as the data quality issues inherently deter the learning of any algorithm. It follows the “Garbage in, garbage out” principle, i.e. poor data quality can only result in poor learning. Good quality data is crucial to the success of any machine learning algorithm.
You must be thinking why I am emphasizing data quality so much, so let us look at the definition of ‘Machine Learning’ and infer its significance:
As per Wikipedia:
“Machine Learning involves computers learning from data provided so that they carry out certain tasks”
It is fundamentally built to learn what data suggests it learn. If the input data has errors. It will learn those errors and would reflect them in the “predictions gone wrong”.
With time, I have developed the habit of performing key data quality checks and would share those with you in today’s article:
- Typo errors: These are the data-entry mistakes. A record does not match other related records due to typographical errors.
- Duplication by mistake: A record gets repeated multiple times making it difficult to identify whether it is by mistake or duplication is valid as per the dataset
- Data-entry errors: The most common data entry errors are the ones where digit ‘0’ gets replaced with the alphabet ‘O’. Such error changes the order/scale of the number and is generally identified by statistical tests. For example, if the leading 1 in 10020 gets removed by mistake, then it becomes 20 which is a gigantic change in the record. A typo in the ‘Age’ field of 112 when in fact the ‘Age’ is 12.
- Check for missing values: I had observed that the field ‘Savings_accounts’ had missing values in one of the datasets related to “credit risk associated with the customers”. But an in-depth analysis of the data suggested that the variable does not have any value for the cases where the customer has no ‘Savings_accounts’. As a result, I removed “null’ by replacing them with ‘no_account’ entry
- Columns swapping: I have encountered this more often. I once worked on german credit risk modeling where the data did not have any headers. So, I parsed the column names in the order of appearance in the description. But apparently, that was not correct and there were disastrous misnomers occurring. The field Credit Purpose values ranged from 4 to 184, but there could not be so many types of credit purposes. Upon diagnosis, it later appeared that these values correspond to the feature ‘Residence_since’
Now that we are aware of some of the common data entry issues, let's discuss how we can avoid them:
- An ideal way to prepare the quality data would be to put validations at the source itself. But, it could not be controlled, hence let us do our best to control them even at a later stage.
- Write a script to perform data quality checks for the possible presence of errors in the data and flag them for the user. If it is flagged but yet unresolved, it is a best practice to include a comment for the analysis by the end-user
- Create checkpoints, for example, removing duplicates to make distinct records. Or, applying checksum or total fields to check for data consistency.
- Date errors: Month should range between 1 to 12, the delivery date of a product should not be before the order date.
- Perform statistical tests to observe outliers, whether they are by mistake or are legitimate. There are instances where a variable contains values that cannot exist, given the nature of the data. For example, blood pressure readings cannot be negative.
- Variable name might get duplicated with whitespace issue, for e.g. ‘Credit History’ vs ‘CreditHistory’. Similar issues can arise in variable values, for example, The month column could have March as well as Mar (abbreviated form), September vs Sept, etc
- Missing values: Are missing values of one variable correlated with the missing values of another variable, or is it missing at random. For example: how much quantity a customer requested vs how much quantity order was ultimately confirmed by the vendor are highly correlated in general, and so is their missingness quotient.
- Major flag: If the rows or certain columns have a large % of values missing, highlight, and omit them for data modeling.
- Receiving all fields merged together in one single flat-file format is bliss, which unfortunately isn’t always the case. There are times when a lot of data manipulations, operations, joins, and merges are required besides domain expertise. One such example can be read here.
- Another caution while merging multiple files is to keep track of the shape of the data after every merge. Keep a check of variable1 which was present in both the input data frames before the merge. It gets renamed as variable1_x and variable1_y.
- It is a good practice to include a business document that includes all metadata information about the variables and their types and values.
- There are humongous issues when one is specifically working on OCR projects, a few are:
a) 1 and l, 1 and !, 1 and I.
b) 8 and 5
c) Data loss because of text overlap, generally due to stamp/handwritten text over printed documents.
I have tried covering all the data quality and assessment checks that I generally follow. If you wish to add more, or there are some key data checking tests that I have missed, please do highlight them. I would be happy to learn from you 😊
Thanks for reading.