Image Source: Benjamin O. Tayo

Bad Data Equals Bad Predictive Model

Always question the source and quality of your data before using it for analysis or model building

Benjamin Obi Tayo Ph.D.
Mar 7 · 5 min read

Data is key to any data science and machine learning task. Data comes in different flavors such as numerical data, categorical data, text data, image data, sound data, and video data. The predictive power of a model depends on the quality of data used in building the model. It is therefore extremely important that before performing any data science task such as exploratory data analysis or building a model, you ask yourself the following important questions:

1. What is the source of my data?

a) Purchase of raw data from organizations or companies that mine and store data

b) Data from Surveys

c) Data from Experiments

d) Data from Sensors

e) Simulated Data

f) From the Internet

Whatever the source of your data, it’s important that you understand how the data was collected. For example, data collected from surveys may contain lots of missing data, and false information. Individuals participating in the survey may provide false information. If data was simulated, how good is the simulated data when compared to the sample data? As an example, the simulated data could have been generated under the assumption that the sample data was normally distributed, which may not be the case. Sometimes, it may be appropriate that the data scientist work closely with engineers and other personnel for guidance in understanding about the source and reliability of the data.

2. What is the quality of my data?

a) Error in Data Collection

b) Error in Data Storage

c) Error in Data Retrieval

d) Data Imputation Error

3. Does my data depend on space and time?

Case Study: Real-World Example

In academic training programs, we are often provided with a clean dataset that is error-free. In a real-world data science project, the data has to be scrutinized and evaluated carefully to ensure that the source is reliable, and that the data is error-free and of high quality. Using high-quality data for analysis or model building has 4 main advantages:

Advantages of high-quality data

b) High-quality data can lead to low generalization errors, i.e., model easily captures real life effects and can be applied to unseen data for predictive purposes.

c) High-quality data will produce reliable results due to small uncertainties, i.e., the data capture real effects and has smaller random noise.

d) High-quality data containing a large number of observations will reduce variance error (variance error decreases with sample size according to the Central Limit Theorem).

In summary, we’ve discussed why it’s extremely important to fully understand your data before using the data for further analysis or model building. Always good to make sure to carefully scrutinize and evaluate your data for reliability and quality before using the data for further analysis or model building. Keep in mind that bad data will produce falsified and unreliable predictive models.

Additional Data Science/Machine Learning Resources

Essential Maths Skills for Machine Learning

3 Best Data Science MOOC Specializations

5 Best Degrees for Getting into Data Science

5 reasons why you should begin your data science journey in 2020

Theoretical Foundations of Data Science — Should I Care or Simply Focus on Hands-on Skills?

Machine Learning Project Planning

How to Organize Your Data Science Project

Productivity Tools for Large-scale Data Science Projects

A Data Science Portfolio is More Valuable than a Resume

Feature Selection and Dimensionality Reduction Using Covariance Matrix Plot

Data Science 101 — A Short Course on Medium Platform with R and Python Code Included

For questions and inquiries, please email me: benjaminobi@gmail.com

Towards AI

Towards AI, is the world’s fastest-growing AI community for learning, programming, building and implementing AI.

Benjamin Obi Tayo Ph.D.

Written by

Physicist, Data Science Educator, Writer. Interests: Data Science, Machine Learning, AI, Python & R, Predictive Analytics, Materials Sciences, Biophysics

Towards AI

Towards AI, is the world’s fastest-growing AI community for learning, programming, building and implementing AI.

More From Medium

More from Towards AI

More from Towards AI

More from Towards AI

Image Filtering

More from Towards AI

___
Mar 29 · 8 min read

103

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade