Pipeline for Exploratory Data Analysis and Data Cleaning.
Exploratory data analysis(EDA) according to Wikipedia is an approach to analyzing data-sets to summarize their main characteristics, often with visual methods. EDA in a data science context refers to the critical process of performing initial investigations on data so as to discover patterns,to spot anomalies,to test hypothesis and to check assumptions with the help of summary statistics and graphical representations.
Data cleaning on the other hand is the process of detecting, correcting and ensuring that your given data set is free from error, consistent and usable by identifying any errors or corruptions in the data, correcting or deleting them, or manually processing them as needed to prevent the error from corrupting our final analysis.
EDA and data cleaning both go pari passu as we prepare our data for analysis. Their importance cannot be over emphasized when carrying our data science tasks on our data set. Hence the need for this article.
In this article i am going to relate the various steps that a data scientist should perform on his data and the importance of those steps obtaining a clean data set for analysis. The steps will be highlighted by bullet points and will be a combination of both EDA and data cleaning techniques. I have documented all the code examples i’m going to be using here, so you can understand and follow through.
- Visual inspection of data: This involves viewing the first 5 and last five rows of data frame to have an idea how the data is presented.
From the image above we already picture some missing data. Next
- Carryout Summary statistics: This gives us a quick and simple description of the data. It will tell us information such as mean, median, mode, variance. Outliers can also be spotted. It can be performed on columns that contain numeric data.
- Converting to suitable data types: Data types may come in a wrong format which makes analysis difficult, for instance a column supposed to contain numeric data may be stored as string or a column supposed to contain categorical data stored as a string. Hence it always necessary to convert to suitable data types to improve ease of analysis
- Frequency Count: This is used to show us the value count of each object or numeric data in each column
- Data visualization using Matplotlib: This package is used to visualize data in python. The package contains statistical tools such as boxplot, histogram,scatter plot e.t.c. Histogram is used for discrete data and to compare the relationship among each numeric data in a column, scatter plot used to show the relationship between 2 numeric variable and box plots is used for basic summary statistics between a numeric and a non-numeric variable.
- Checking for tidy data: Data frames have to be checked to see if it complies with the tidy data principle. What’s the tidy data principle you may ask?, the tidy data principle states that columns represent separate variables, rows represent individual observations and each type of observational unit forms a table. This is shows a standard way to organize a data values within a data set. For further reading on tidy data, Hadley Wickham’s paper on tidy data is worth the read. To solve the problem of untidy data, the pandas function melt always comes in handy. The image below will show a before and after process of melting. After melting you can always parse out to obtain the desired column variable.
- Combining suitable data-sets : Data may not always come in one data file or table for you to load. This is because it is easier to share and store data files this way, other times it may be because there’s a new data file for each day, stock series data is a ready example. Whatever the case may be, it’s always best practice to combine these data sets for analysis. Pandas concatenate, glob and join functions are always very useful for combining data sets. The glob function can be used to find file based on a particular pattern and join them into a single data frame. An example is shown below:
From the bloc of code above we can see that we were able to combine 3 data sets. This is the beauty of panda’s glob function. The full code is accessible here
- Handling Inconsistent data entries: Inconsistent data entries are basically similar data entries that are stored in a way that makes them dissimilar due factors such as white spaces, different letter cases, punctuation marks etc. It is always necessary to match and combine all similar data entries.
- String Manipulation: Most of data cleaning will involve string manipulation because most the world’s data involves unstructured text. String manipulation is an essential way of obtaining numeric data from strings. Python’s built in string manipulation library Regex always come in handy during string manipulation. Regex uses pattern matching to match the specified variable and can be used continuously. An example of string manipulation using Regex is shown below
- Duplicate and Missing Data: Duplicate data and missing data is problematic because it can affect your analysis in undesirable ways. It is always necessary to find out why data is missing or being duplicated, whatever the reason duplicate and missing data can be dropped or filled. Missing data can be filled using summary statistic or binary values i.e ones and zeros. Whatever method you choose it is always best to know why the data is being duplicated or missing.
- Using Assert Statements:Assert statement is a simple way to verify certain information in our data set. Assert statement helps to detect early early warnings and errors and gives us confidence that our code is running correctly. Assert statements work like this: if you give it a statement that is true it will return nothing, if you give a statement that is false it will return an error. Below is code bloc of an assert statement.
Conclusion
The Exploratory Data Analysis(EDA) and data cleaning techniques listed in this article are among the various techniques used in preparing your data for analysis. Although, it is important to note that not all of these techniques will work for all data sets, so it always import to study your data holistically to know which of the techniques to employ. You can find the code for each technique stated in this article in this github repository.
If you have any comments or additions be sure to leave them in the comment section and if you did enjoy the article do not forget to give a round of applause.
Information graphic types[hide]
- Line chart • Bar chart
- Histogram • Scatterplot
- Boxplot • Pareto chart
- Pie chart • Area chart
- Control chart • Run chart
- Stem-and-leaf display • Cartogram
- Small multiple • Sparkline
- Table
Related topics[hide]
- Data • Information
- Big data • Database
- Chartjunk • Visual perception
- Regression analysis • Statistical model
- Misleading graph
- v
- t
- e
In statistics, exploratory data analysis (EDA) is an approach to analyzing data sets to summarize their main characteristics, often with visual methods.