Bad Data Equals Bad Predictive Model
Always question the source and quality of your data before using it for analysis or model building
Data is key to any data science and machine learning task. Data comes in different flavors such as numerical data, categorical data, text data, image data, sound data, and video data. The predictive power of a model depends on the quality of data used in building the model. It is therefore extremely important that before performing any data science task such as exploratory data analysis or building a model, you ask yourself the following important questions:
1. What is the source of my data?
Data used for analysis or model building could be obtained from several sources:
a) Purchase of raw data from organizations or companies that mine and store data
b) Data from Surveys
c) Data from Experiments
d) Data from Sensors
e) Simulated Data
f) From the Internet
Whatever the source of your data, it’s important that you understand how the data was collected. For example, data collected from surveys may contain lots of missing data, and false information. Individuals participating in the survey may provide false information. If data was simulated, how good is the simulated data when compared to the sample data? As an example, the simulated data could have been generated under the assumption that the sample data was normally distributed, which may not be the case. Sometimes, it may be appropriate that the data scientist work closely with engineers and other personnel for guidance in understanding about the source and reliability of the data.
2. What is the quality of my data?
Several factors can affect the quality of data such as:
a) Error in Data Collection
Data collection can produce errors at different levels. For instance, a survey could be designed for collecting data. However, individuals participating in the survey may not always provide the right information. For example, a participant may enter wrong information about their age, height, marital status, income, etc. Error in data collection could also occur when there is error in the system designed for recording and collecting the data, for instance a faulty sensor in a thermometer could cause the thermometer to record erroneous temperature data.
b) Error in Data Storage
Storing data could lead to error as some data could be save incorrectly or part of the data could be lost during the storage process.
c) Error in Data Retrieval
Retrieving data can also produce error, as some part of the data may be missing or could be corrupted.
d) Data Imputation Error
Often, the removal of samples or dropping of entire feature columns is simply not feasible, because we might lose too much valuable data. In this case, we can use different interpolation techniques to estimate the missing values from the other training samples in our dataset. One of the most common interpolation techniques is mean imputation, where we simply replace the missing value by the mean value of the entire feature column. Other options for the imputing missing values are median or most frequent (mode), where the latter replaces the missing values by the most frequent values. This is useful for imputing categorical feature values. Another imputation technique that can also be used is median imputation. Whatever imputation method you employ in your model, you have to keep in mind that imputation is only an approximation, and hence can produce error in the final model. If data supplied was already preprocessed, you would have to find out how missing values were taken into account. What percentage of the original data was discarded? What imputation method was used to estimate missing values?
3. Does my data depend on space and time?
It’s important to know about the space- and time-dependence of your data. Was the data collected at different geographic locations? What time was the data collected? Does the data vary with time? Knowledge about the spatial and temporal dependence of the data will help in deciding what kind of model would be suitable for the given data.
Case Study: Real-World Example
I worked on a data science project with one industry. I will not be disclosing the full details of the project here. But the project involved using data generated in real time to predict anomaly in the operation of a very important device. My team had to work with a multidisciplinary team including mechanical engineers, electrical engineers, system engineers and field technicians to be able to correctly frame the right questions to be addressed using available data. Because of lack of domain knowledge, we had to rely on engineers and other personnel to help us determine which features are crucial and what the target variable is. By looking at the data from different angles and performing several data preprocessing tasks, we found out that a huge chunk of the available data was collected from sensors that were bad. So after several meetings and discussions with the engineers, they were able to generate a new set of reliable data that was error free. The lesson learnt from this experience is that before building any model, make sure you understand your data very well. It’s important to understand how the data was collected, and if necessary, ask several questions to the individuals who collected the data. That way you can ensure that the data is error free and of high quality.
In academic training programs, we are often provided with a clean dataset that is error-free. In a real-world data science project, the data has to be scrutinized and evaluated carefully to ensure that the source is reliable, and that the data is error-free and of high quality. Using high-quality data for analysis or model building has 4 main advantages:
Advantages of high-quality data
a) High-quality data is less likely to produce errors.
b) High-quality data can lead to low generalization errors, i.e., model easily captures real life effects and can be applied to unseen data for predictive purposes.
c) High-quality data will produce reliable results due to small uncertainties, i.e., the data capture real effects and has smaller random noise.
d) High-quality data containing a large number of observations will reduce variance error (variance error decreases with sample size according to the Central Limit Theorem).
In summary, we’ve discussed why it’s extremely important to fully understand your data before using the data for further analysis or model building. Always good to make sure to carefully scrutinize and evaluate your data for reliability and quality before using the data for further analysis or model building. Keep in mind that bad data will produce falsified and unreliable predictive models.
Additional Data Science/Machine Learning Resources
For questions and inquiries, please email me: email@example.com