A creative introduction to EDA

Jacopo Pasqualini
MLJC
Published in
13 min readOct 18, 2020

What is the Exploratory Data Analysis (EDA)?

What you should do when a new, unknown dataset is given.

Here we are going to review some basic techniques to perform a successful exploration of data. There are many motivations behind this set of procedures. If you get meaningful insights you may be able to design outperforming models saving computational efforts. If the intuition about data is clear it will be easier to develop appropriate hypothesis with which build a good model. And so on…
This insights will have different consequences on your dataset. You may be lead to keep or modify some feature, remove o create some feature from the old one. Such procedure have different names like feature selection, among others. Here we will use the term EDA as a (positive) umbrella term.
A good EDA can help you to decide in which way you should spent more or less time in preprocessing, building baselines, modeling ecc..

One important counterpart is data visualization. If there is a pattern in the data it may be found, with some luck, with an appropriate visualization. Here we will show some, EDA oriented, visualization techniques. It means that these visualization will be mid-quality but very easy to implement.

Given a problem which you know nothing about a good strategy is to make up your mind and clarify what you should do. Mostly, we don’t need to know that much about the topic, since data science is a very generic discipline and cover a wide range of topics. Just a couple of tips:

1 search other people solution to similar problems with github, medium..
2 googling and searching on wikipedia, to get the features meaning

Disclaimer

Keep in mind that EDA is not a linear, sequential procedure. You will never be given a recipe or a list of steps to follow. Here we are in the realm of intuition and the reader should interpret this notebook as a collection of tips, techniques, good practices.

Some lines of code are not extensively explained and some passages may seem dead leaves.. they are suggestions and is up to you to develop what you find interesting.

We strongly encourage to use your creativity!

LET’S START

You can download the notebook from the link:

https://github.com/MLJCUnito/HowToTackleAMLCompetition

You will never use one library, but rather a combination of multiple libraries.
Be bright, and develop your intuition about how to combine at least the most famous python libraries.

First we import some libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os
import seaborn as sns
%matplotlib inline

Here we set an environment variable,the path that will tell the computer where data are stored.

You may want to download the dataset directly from the jupyter notebook, but it is quite a complicated story that depends on the machine you are working with…
You can download the dataset from your browser, and move it to your working directory with your favorite procedure.

You can find the dataset here: https://www.kaggle.com/fayomi/advertising/download

DATA_FOLDER = # your path here
adv = pd.read_csv(os.path.join(DATA_FOLDER, 'advertising.csv'))

Let’s give a look at the data, and check what features contains.
This step is crucial, because if we already know the research field where features come from we could start to figure out how to proceed with our analysis. We may easily figure out what the baselines of the problem are. We need to be as informed as possible about features meaning. It is often useful answer one question in different ways, ie understand different commands can help us to solve the same problem. Maybe, sometimes further, these little differences in the output will be helpful!

What can we expect to see when exploring the basic features of a dataframe?
How many entries are there in the dataset?
Which features are contained?

print( “ Dataset shape: “,adv.shape[0])
print( “ Number of features: “, adv.shape[1] )
print( “ Your features are: \n”)
print(adv.columns)
print(“ Features Types: \n “)
print(adv.dtypes)

The standard access to a dataframe column is dataframe.[“featuename”]. We can have easier access, avoiding strings usage, in the format `dataframe.featurename`, as if it were a object method.

We can do that if we slightly change the columns names removing all the spaces and renaming the columns with these new strings

features = list(adv.columns)for f in features:

phi = f.replace(‘ ‘,’’)
adv = adv.rename(columns = {f:phi})

The following command just shows the first rows of our dataset. In the first column we have the features names. The remaining columns are the entries of the dataset. This kind of visualization may be more eye pleasant the the former ones if the number of features is not that large

adv.head().T

Another important dataframe method is `info`, which summarizes the dataframe content.
With this command we can even see that there are not missing values in the data which, as we will see in future posts, may be something important to think about.

adv.info()

It is in the spirit of EDA to understand which feature of the raw dataset plays which role.

Since this is an advertisement dataset we can say that there is one feature more important than all the other, ClickedonAd that is a binary feature. Let’s see what contains:

adv.ClickedonAd.head(11)

Now we can check if the dataset is balanced, ie if there are as many samples of one kind as the other:

n = len(pd.unique(adv[‘ClickedonAd’])) 

print(“Number of unique values :”, n)
print(adv[‘ClickedonAd’].value_counts())

We were lucky! We found that our dataset is perfectly balanced. In real situations this never happens and this fact absolutely needs to be checked every time we analyze a new dataset.

Now, brief digression. This perfect balance is a feature of `case study` datasets. They are data preprocessed by the provider in order to be easier to analyze. Even if they are `toy problems` they are very helpful because you can learn the basics of EDA without worrying that much about other messy things such as decoding, preprocessing, augmentation and so on.

For sake of simplicity we can save the target feature in a different object and then remove it from the data.

target = adv.ClickedonAd
adv = adv.drop(["ClickedonAd"],axis=1)

Here is a binary classification problem!

We can investigate features correlations plotting the 2d histogram of a couple of variables. This exploration may help us to decide how to proceed in data preprocessing or, in general, further investigations.

f, ax = plt.subplots(figsize=(10, 10))
sns.kdeplot(adv.Age, adv.DailyTimeSpentonSite, color="b", ax=ax)
sns.rugplot(adv.Age, color="r", ax=ax)
sns.rugplot(adv.DailyTimeSpentonSite, vertical=True, ax=ax)

From the former plot we see that we may proceed with a clustering analysis or, in general, unsupervised learning procedures.

We can extend the former reasoning plotting all the couples of numeric features. We could do that for other feature types, and we left it as an exercise for the reader. A reason for our choice is that pandas can encode integers and categorical features in the same way. This happens because it may not be clear for some variables, such as age, if they should be treated as integers or categorical. How to use these amphibious kind of variable is up to you and your intuition /knowledge about data.

numeric_adv = adv.select_dtypes(include=[np.float])numeric_adv.head()
from pandas.plotting import scatter_matrix
scatter_matrix(numeric_adv, alpha=0.3, figsize=(10,10))

This plot is not that clear but suggest that we should investigate the correlations between `DailyTimeSpentonSite` and `DailyInternetUsage`. This clustering will identify two communities: the first is composed by users that spent a lot of their daily internet time on the site, the other that do the opposite.

f, ax = plt.subplots(figsize=(6, 6))
sns.kdeplot(adv.DailyInternetUsage, adv.DailyTimeSpentonSite, color="b", ax=ax)
sns.rugplot(adv.Age, color="r", ax=ax)
sns.rugplot(adv.DailyTimeSpentonSite, vertical=True, ax=ax)

We can tye to create a new feature, the `faithfulness` of a user. We would like that this feature encodes the clustering structure. We expect that this feature exhibits a bimodal histogram so it will be easier for our algorithm to build correlations between data and target. We try the following combination

from scipy.stats import normadv["Faithfulness"]=(
adv["DailyTimeSpentonSite"]*adv["DailyInternetUsage"]/
(adv["DailyTimeSpentonSite"]+adv["DailyInternetUsage"])
)
sns.distplot(adv[‘Faithfulness’], hist=False, color=’r’, rug=True);

We combined the information of two feature into a new, maybe better one. At this point we may be tempted to drop the two old features. This may lead to some issue during training and so on but this is not the place to discuss this problem. We are happy to have found such a new feature and for sake of simplicity we will drop the old features keeping only the new one. This is a procedure that is known as feature reduction.

Now let’s focus on the features encoded by pandas as objects:

object_features = ['AdTopicLine', 'City', 'Country']
adv[object_features].describe(include=['O'])

As we can see from the table above that all the values in column “Ad Topic Line” is unique, while the “City” column contains 969 unique values out of 1000. There are too many unique elements within these two categorical columns and it is generally difficult to perform a prediction from a flat histogram. We can consider those features as noise since or they have uniform almost distribution or they are constants, which does not carry information too. In the next post we will see some more robust procedure to detect such features. Because of that, they will be omitted from from the analysis.

The third categorical variable, i.e “Country”, has a unique element (France) that repeats 9 times. Additionally, we can determine countries with the highest number of visitors. The table below shows the 20 most represented countries in our DataFrame.

pd.crosstab(index=adv['Country'],
columns='count').sort_values(['count'],
ascending=False).head(10)

We have already seen, there are 237 different unique countries in our dataset and no single country is too dominant. A large number of unique elements will not allow a model to establish meaningful relationships. For that reason, this variable will be excluded too. It is too difficult to learn from (almost) flat distribution that is equivalent to white noise.

data = adv.drop([‘AdTopicLine’, ‘City’, ‘Country’], axis=1)

Now, let’s focus on a single feature, *Timestamp*. It contains date and time of the day when the entry was recorded. Entries are recorded in the format *date&hour*. It is likely that, if we split this one in two new features, like *day* and *hour* we may be able to observe different kind of pattern occurring in the two new features. The idea here is that individual activities are determined by time management. We manage in different ways our time on the week scale respect to the 24h scale. It is often a good idea to create new features from the “raw”. With a physics analogy, it is like if we are separating the different time scales of our problem. First, we give a look at this “Mother” feature:

print ( adv.Timestamp )

Our purpose now is to create three new features by splitting one column in three. If we split features and properly operate on them we may get something useful.

The “Day of the week” variable contains values from 0 to 6, where each number represents a specific day of the week (from Monday to Sunday). The category values here are the days of the week, which may be useful to detect periodic pattern in our data. Hours will be categorical too. Date will remain in pandas time format.

We can proceed as follows

1 convert the timestamp column (string format) into pandas datetime format
2 through the dt method split time and date, hour and day of the week and assign them to new columns
3 Eliminate the timestamp column. There is no information loss here
4 check the output

adv[‘Timestamp’] = pd.to_datetime(adv[‘Timestamp’])
adv_times=adv[‘Timestamp’]
adv[‘Hour’] = pd.to_datetime(adv_times, format=’%H:%M’).dt.hour
adv[‘Date’] = pd.to_datetime(adv_times, format=’%M:%D’).dt.date
adv[‘WDay’] = pd.to_datetime(adv_times).dt.weekday
adv[‘Click’] = targetadv = adv.drop(['Timestamp'], axis=1)adv.head()

We can evaluate how much time our dataset spans and fix our problem timescale

print('Train min/max date:', adv.Date.min(), adv.Date.max() )
print('Dataset time span: ',adv.Date.max()-adv.Date.min())

First, one very simple histogram: which day of the week carry us more information? Is there a “special day”?
It does not seem, but we can detect a periodic pattern in the data. We can do the same for the hours of the day.

ax = adv['WDay'].value_counts(sort=False).plot(kind='bar')
ax.set_xlabel("Day of the week")
ax.set_ylabel("Records")
ax = adv[‘Hour’].value_counts(sort=False).plot(kind=’bar’)
ax.set_xlabel(“Hour of the day”)
ax.set_ylabel(“Records”)

To obtain this result we have never filtered the data or, equivalently, we didn’t transformed our features. They are still raw. Can we extract more information -as in the first feature reduction case- modifying the original data, applying some kind of filtering? If true, this may lead us to meaningful insights.
We can try to drop all entries where the used did not clicked on the ad, in order to investigate if there is a daily-based preference in clicking

adv_time = pd.DataFrame()adv_time[‘WDay’] = adv[‘WDay’]
#adv_time[‘Hour’] = pd.to_datetime(adv[‘Hour’])
adv_time[‘Hour’] = adv[‘Hour’]
adv_time[‘Click’] = target
adv_time.info()

Can we see if there are hour of the day where we should do targeted advertisement? The idea here is to separate, according to the target feature (ClickedOnAd) a variable of interest. This may not lead to a particular result, but is to enforce our intuition about data. Now let’s focus only on the time features. We can try to see if the appreciation or not of our advertise exhibit different patterns that we can employ in some way. Here we employ the mask concept: it can be thought as a boolean grid that filters our dataframe according to a characteristic condition and help us to split the data in more and more parts

mask_one = (adv_time[‘Click’] == 1)
adv_time_one = adv_time[mask_one]
mask_zero = (adv_time[‘Click’] == 0)
adv_time_zero = adv_time[mask_zero]

Let’s see if there are some evident patterns in the features we found, with a simple plot

fig = plt.figure(figsize=(20,15))ax = fig.add_subplot(221)
adv_time_one['WDay'].value_counts(sort=False).plot(kind='bar')
ax.set_xlabel("WDay people click")
ax.set_ylabel("Records")
ax = fig.add_subplot(222)
adv_time_zero['WDay'].value_counts(sort=False).plot(kind='bar')
ax.set_xlabel("WDay people DON'T click")
fig = plt.figure(figsize=(20,15))ax = fig.add_subplot(221)
adv_time_one['Hour'].value_counts(sort=False).plot(kind='bar',title="Hour people click")
ax.set_xlabel("Hour people click")
ax.set_ylabel("Records")
ax = fig.add_subplot(222)
adv_time_zero['Hour'].value_counts(sort=False).plot(kind='bar',title="Hour people DON'T click")
ax.set_xlabel("Hour people DON'T click")

There may be patterns in data, but it is not obvious how to proceed. It is not obvious which model we should choose, since there are no evident clustering or periodicity in the preference or not for clicking.

With this little failure we are facing the necessity to introduce an automatic feature extraction mechanics, possibly designed according to some optimization principle. This is the basic idea behind machine learning or, more specifically, neural networks!

Conclusions:

In this post we introduced some good practices for an efficient EDA. If the number of features is not too large we can try to study little groups of features, according to their type. Once this is done we have a wide range of choices. With some patience we can investigate the basic correlations between our features. For a first step a basic scatter plot is enough. From the insights that we obtain here we can try to investigate some couples of variables that exhibit particular patterns. If this is the case we can try to create new features as combinations of the raw ones and if this combination seems to reasonably resemble the properties of the couple clustering we can replace the old variables with the new one. This procedure is known as feature reduction. We can get feature reduction even investigating the variability of values that our features take. We can group those values in a histogram and if it happens to be flat we can consider it as pure noise and drop these features. We can follow the opposite practice: feature augmentation. This procedure allows us to create new features from a raw one, as happened with temporal data. From a yy/mm/dd hh/mm entry we created tree new features: the hour, the day of the week and the date of the advestising-user interaction, starting from a a physics-like reasoning. Unfortunately, further investigation showed us that such procedure does not lead to clear patterns. This little failure is not purely a bad thing and we can learn something important from that. If data does not exhibit a clear clustering we may imagine to introduce some algorithm able to extract or design with an automatic procedure meaningful combinations of the features. That’s why with highly dimensional problems we need to introduce machine learning.

I hope this article was useful for you. If you have any, do not hesitate in posting a comment! Thanks!

References:

https://www.coursera.org/learn/competitive-data-science

https://stackabuse.com/predicting-customer-ad-clicks-via-machine-learning/

Yaser S. Abu-Mostafa, Malik Magdon-Ismail, Hsuan-Tien Lin, “Learning from Data”

--

--