Doing EDA on a classification project? pandas.crosstab will change your life.

Exploratory Data Analysis, or EDA, makes up a good portion of a data scientist’s work. In fact, according to the 2017 O’Reilly’s data science survey, basic EDA is the data scientist’s most common task. And there’s a good reason for that. Barring sheer luck, your most intricate, robust models won’t do very much if you don’t understand your data. They certainly won’t perform as well as they could have.

An understanding of your data, it’s characteristics, and it’s distributions is vital to any successful data science task, whether it’s inference or prediction. And contrary to what you might expect, the reason for EDA’s importance is not technical, and has nothing to do with programming. It’s the thing that separates a mediocre data scientist from a great one — decisions.

As a data scientist, you program. You use statistics. You build models. But the most important and probably the most difficult thing you do is make lots of choices. And your choices have consequences. Sometimes large consequencs. So you do the best you can to make responsible, informed choices. EDA is the best time you have to inform your choices. Good EDA is the difference between working blind and hoping for the best and making deliberate decisions to achieve your goals.

Ok, you get the point, EDA is important. Now, on to the magic of pd.crosstab().

Recently, I did a project using the Bank Marketing Data Set available here from the UCI Machine Learning Repository. Here is what it looks like:

Each observation is a potential customer of the bank, and the ‘y’ variable is whether or not they subscribed to a new term deposit. As you can see, the data includes personal information about each customer, as well as information about the bank’s previous efforts in marketing to that client. It’s a relatively clean data set, and quite a fun project. The primary challenge is dealing with the unbalanced classes, since only about 11% of customers subscribed to the term deposit. Like I said, it’s a fun project and well worth a look. But back to the EDA.

One of the first things to decide is which variables to include in the model. As an example, let’s look at the ‘job’ column. It lists the field or position that each customer has. There are twelve types of jobs that occur in this data set. To decide if you want to use the job information in your model, you have to determine if the job that a person has is correlated with an increased likelihood to subscribe to the deposit. To look at the question in the extreme, if every single ‘technician’ in the data set subscribed to the deposit, I’m going to assume that technicians are more likely to subscribe, and I’m going to use that information to make predictions. On the other hand, if each type of job has the same proportion of people who subscribed to the deposit, then the job information won’t be helpful in making predictions.

So how can you figure out the percentages of people who subscribe for each job type? Here’s the function I used to use: bank.groupby('job').y.value_counts() which will group the observations by job type, and count the occurences of ‘y’ for each job type.

That returns this:

That answers the question that we need, but it looks pretty messy. What if we could turn the values of ‘y’ into a table? And… here comes pd.crosstab(bank.job, bank.y) which gives us this:

Now that’s much prettier.

Okay great, so we have the table. We can eyeball it and figure out that about 10% of housemaids and service workers subscribe to the deposit. And it’s clear that a much higher percentage of retired people subscribe to the deposit. So we can get an idea of the distributions... But we want more. We want exact percentages of subscription for each job type.

pd.crosstab() has got you covered once more. Just add in the ‘normalize’ parameter like this: pd.crosstab(bank.job, bank.y, normalize='index'). Now it looks like this:

Looking good!

Perfect. A percentage breakdown by class for subscribtion to the target variable. Exactly what we need. Now it’s clear that the job type is probably relevant information for our models.

You can run this function on any categorical variable for a classification task. It’s the perfect method to assess that breakdown. Using this and only this function, you can get a very significant amount of insight into a data set. Use it wisely, and use it well!