Data Analytics 101 Series — The ‘Process’ Phase

Adith Narasimhan Kumar
Analytics Vidhya
Published in
4 min readAug 18, 2023

--

Did you know that bad data or poor data quality costs US businesses $600 billion annually? It is a known fact that data is one of the most influential and important currencies in the world. But real-world data is not perfect. there are a plethora of inaccuracies and errors that constitute any collection of data. It is the job of a data analyst to find these errors and rectify them. Want to know how the process works? Let’s find out!

Imagine this! you are a data analyst who is hard at work and receives a dataset. Voila! the data is clean! and then you wake up!

Clean and perfect data is very rare and is almost impossible to find. Real-world data is often messy and contains many inconsistencies that we’ll have to deal with as data can be obtained from first, second, and third parties. In this article, we’ll take a look at the most common inconsistencies and errors found in a dataset and how to rectify them as well! Correcting these inconsistencies is key as they directly impact the outcome of our analysis. This can be a time-consuming and challenging task, but it is essential to ensure that the data is accurate and reliable.

Once the data is cleaned, it needs to be transformed in order to draw possible explanations and find patterns in the data. This involves correcting the format of the data to suit our analysis. This key task includes changing data types, removing outliers, and creating derived variables to add context and meaning to our analysis. Specific steps in this process may vary depending on the dataset but on a high level, they remain the same.

Common Errors in a Dataset

Certain errors are specific to a dataset. But most of them are common and can be found in any dataset. These include

1. Null Values:

Null values are empty or missing values in a data set. They occur when a particular data is not captured or corrupted.

Null value table

Solution: Remove rows with missing values, imputing values, and using the mean as a substitute.

2. Duplicate Rows:

Certain rows may be duplicated and this causes an increase in file size and redundancy to the data. This leads to inaccurate results.

table with duplicated rows

Solution: Remove duplicate rows, keeping the first occurrence, using the median as a substitute for the duplicated row.

3. Inconsistent data types:

Inconsistent data types occur when different columns in a data set contain different data types. This causes errors when using these columns to produce derived variables.

table with inconsistent data

Solution: Convert the columns into their respective formats.

4. Outliers:

Outliers are data points that are significantly different and extreme from the rest of the data in a data set. These are also known as anomalies, aberrations. Outliers can be caused by errors in data collection or data entry.

Solution: Remove outliers, impute outliers, and treat outliers by transforming them.

Conclusion

The process phase of the data analytics process is one of the most crucial if not the most crucial of them all. Clean data plays a vital role and can influence the decisions that we draw from it. Data cleaning is a time-consuming process but is worth the effort as clean data produces great results!

Hope you liked my article! Do share your thoughts and inputs and I’ll try to have them answered to my knowledge in future articles!

Happy Learning!

Check out my other articles on Blockchain and Machine Learning/Deep Learning. Let me know about any other topics to cover in the future!

Catch my previous article here 👇

--

--