Visualisations of data for help in data cleaning

5 min readAug 16, 2020

The role of a data scientists involves retrieving hidden relationships between massive amounts of structured or unstructured data in the aim to reach or adjust certain business criteria. In recent times this role’s importance has been greatly magnified as businesses look to expand insight about the market and their customers with easily obtainable data.

It is the data scientists job to take that data and return a deeper understanding of the business problem or opportunity. This often involves the use of scientific methods of which include machine learning (ML) or neural networks (NN). While these types of structures may find meaning in thousands of data points much faster than a human can, they can be unreliable if the data that is fed into them is messy data.

Messy data could cause have very negative consequences on your models they are of many forms of which include:

Missing data:

Represented as ‘NaN’ (an acronym of Not a Number) or as a ‘None’ a Python singleton object.

Sometimes the best way to deal with problems is the simplest.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as snsdf = pd.read_csv('train.csv')df.info()

A quick inspection of the returned values shows the column count of 891 is inconsistent across the different columns a clear sign of missing information. We also notice some fields are of type “object” we’ll look at that next.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 11 columns):
survived    891 non-null int64
pclass      891 non-null int64
name        891 non-null object
sex         891 non-null object
age         714 non-null float64
sibsp       891 non-null int64
parch       891 non-null int64
ticket      891 non-null object
fare        891 non-null float64
cabin       204 non-null object
embarked    889 non-null object
dtypes: float64(2), int64(4), object(5)
memory usage: 76.7+ KB

Alternatively you can plot the missing values on a heatmap using seaborn but this could be very time consuming if handling big dataframes.

sns.heatmap(df.isnull(), cbar=False)

Inconsistent data:

Inconsistent columns types: Columns in dataframes can differ as we saw above. Columns could be of a different types such as objects, integers or floats and while this is usually the case mismatch between column type and the type of value it holds might be problematic. Most important of format types include datetime used for time and date values.
Inconsistent value formatting: While this type of problem might mainly arise during categorical values if misspelled or typos are present it can be checked with the following:

df[‘age’].value_counts()

This will return the number of iterations each value is repeated throughout the dataset.

Outlier data:

A dataframe column holds information about a specific feature within the data. Hence we can have a basic idea of the range of those values. For example age, we know there is going to be a range between 0 or 100. This does not mean that outliers would not be present between that range.

A simple illustration of the following can be seen graphing a boxplot:

sns.boxplot(x=df['age'])
plt.show()

The values seen as dots on the righthand side could be considered as outliers in this dataframe as they fall outside the the range of commonly witnessed values.

Multicollinearity:

While multicollinearity is not considered to be messy data it just means that the columns or features in the dataframe are correlated. For example if you were to have a a column for “price” a column for “weight” and a third for “price per weight” we expect a high multicollinearity between these fields. This could be solved by dropping some of these highly correlated columns.

f, ax = plt.subplots(figsize=(10, 8))corr = df.corr()
sns.heatmap(corr, mask=np.zeros_like(corr, dtype=np.bool), cmap=sns.diverging_palette(220, 10, as_cmap=True), square=True, ax=ax)

In this case we can see that the values do not exceed 0.7 either positively nor negatively and hence it can be considered safe to continue.

Making this process easier:

While data scientists often go through these initial tasks repetitively, it could be made easier by creating structured functions that allows the easy visualisation of this information. Lets try:

----------------------------------------------------------------
from quickdata import data_viz # File found in repository
----------------------------------------------------------------from sklearn.datasets import fetch_california_housingdata = fetch_california_housing()
print(data[‘DESCR’][:830])
X = pd.DataFrame(data[‘data’],columns=data[‘feature_names’])
y = data[‘target’]

1-Checking Multicollinearity

The function below returns a heatmap of collinearity between independent variables as well as with the target variable.

data = independent variable df X

target = dependent variable list y

remove = list of variables not to be included (default as empty list)

add_target = boolean of whether to view heatmap with target included (default as False)

inplace = manipulate your df to save the changes you made with remove/add_target (default as False)

*In the case remove was passed a column name, a regplot of that column and the target is also presented to help view changes before proceeding*

data_viz.multicollinearity_check(data=X, target=y, remove=[‘Latitude’], add_target=False, inplace=False)

2- Viewing Outliers:
This function returns a side-by-side view of outliers through a regplot and a boxplot visualisation of a the input data and target values over a specified split size.

data = independent variable df X

target = dependent variable list y

split = adjust the number of plotted rows as decimals between 0 and 1 or as integers

data_viz.view_outliers(data=X, target=y, split_size= 0.3 )

It is important that these charts are read by the data scientist and not automated away to the machine. Since not all datasets follow the same rules it is important that a human interprets the visualisations and acts accordingly.

I hope this short run-through of data visualisation helps provide more clear visualisations of your data to better fuel your decisions when data cleaning.

The functions used in the example above is available here :

Ranykhalil/data_viz

github.com

Feel free to customise these as you see fit!