Distribution visualizations for data science

Published in

spikelab

8 min readApr 16, 2021

--

As data scientists, we have to decide what features to use in a certain machine learning problem. We often take every feature we have available and create others from lag history or intuitive combinations of raw variables and put all of them on the model to see, after training, what is the contribution of each feature (with some variable contribution metric) and decide if it was a good idea to have those features available in the first place.

In this article, I will show some methods to visualize the dependence of the response variable with respect to other features in our dataset to see if they are of value to our specific machine learning algorithm and, in some cases, impose some monotone constraints on our model or gain preliminary insights on the data distribution.

All the plots, examples, and visualizations methods shown in this article are available for you to use and see in this package which can be installed using pip:

Histograms of features partitioned by the response variable

For a binary classification problem, there is a natural binary partition of our dataset, the subset where y=1 and the subset where y=0. With this in mind, for every feature F, we have its distribution in the subset where y=1 and y=0, which in mathematical notation we can formulate like this:

conditional density distribution of feature F on the response variable

If we plot these two distributions densities together in the same plot we can see if the feature F change relevantly when we change the response variable value (changes the partition)

If the feature F is not relevant we will see a plot like this (the second graph shown is the cumulative distribution):

We see that there is no significant difference between distributions D1 and D2

If the feature F is relevant we may see a plot like this:

We see that in a lot of data points D1 is smaller than D2 so an algorithm whose task is to classificate the response variable will probably use this feature (in combination with some other relevant features)

or like this:

In this case, the feature F seems to be almost perfectly correlated with the response variable

This last case is not often seen on real data analysis unless we have some kind of data leakage. We note that a simple step function classificator is almost perfect if this kind of feature is available on our data set.

Can we measure this relevance quantitatively?

The features shown above are sorted by some kind of feature relevance to our machine learning objective and is not so crazy to try to quantify this relevance using some kind of discrepancy measure of both distributions. For example, we can use:

If M1 is the mean of D1 and M2 is the mean of M2 we can compute
|M1-M2| as a measure of relevance
We can compute the KL-divergence between D1 and D2
We can compute the Wasserstein distance between D1 and D2

As a maths person, I personally prefer the third option because of its robustness properties. The Wasserstein distance is a proper mathematical distance metric and has a nice, plug and play, implementation on scipy

So if we have a list of features F_j we can sort them by their relevances R_j and then plot the histograms (as in the examples above) to gain more insights on the dependence between the features and the response variables

But there is a small catch, since the features F_j may have different scales we needed to perform a min-max normalization so the R_j are comparable

The examples shown above have relevance 0.005, 0.11, and 0.86 respectively.

What about categorical variables?

For a categorical feature F, we follow the same logic with the difference that the relevance metric has to be computed in a different way.
In this case, we use the total variation distance which is nothing but the L1 distance of the weights (on the categories) associated with the distributions. The formula looks like this:

Example of histograms of categorical feature with **relevance = 0.17.** Obs: cat_1, …, cat_2 are the categories of the categorical feature F

Example of histograms of categorical feature with **relevance = 0.92**

An example with real data

We use the strokes dataset from Kaggle to demonstrate these methods.

This dataset is used to predict whether a patient is likely to get a stroke based on the input parameters like gender, age, various diseases, and smoking status.

Numerical Features

Let’s see what we get if we take the 3 numerical features age, avg_glucose_level, and bmi.

Using this line of code:

We get the plots (sorted by relevance) and a dataframe with those relevances (distances):

First, we got age. Is very clear that with more age is more likely to have a stroke:

from the cumulative distribution plot we can quickly get insights about these two populations using the interactiveness of the plotly library

Only **~9%** (blue bullet marker) of people that had a stroke has less than 50 years (black bullet marker)

Then we got avg_glucose_level

**~68%** of people that had a stroke has less than ~178 average glucose level (vs ~**90%** for the non stroke havers)

And finally, we got bmi

Only **~3%** of people that had a stroke has less than ~21 bmi (vs **14%** for the non stroke havers)

Categorical Features

Analogous to the numerical feature case we can use this piece of code:

And get the relevance dataframe:

We see that ever_married is the most relevant feature probably due to a high correlation with the age feature. We can actually see this correlation using this line of code:

The difference of age distribution on **ever married** and **never married** populations

For the sake of completeness, we show the second most relevant and the least relevant categorical feature histograms (hypertension and gender respectively)

It’s clear that people with hypertension are more likely to have a stroke: only **~9%** of people that didn't have a stroke have hypertension vs a value of **~26%** for people who had one

there is a small difference in gender. It seems the male proportion is a little bigger for people that had a stroke

Finally, we demonstrate two slightly different uses of this tool.

Firstly, we are going to see the differences in bmi distributions for different work types using this line of code:

Secondly, we can can compare distributions using different condition queries encode on a query dictionary. Here we are going to compare the avg_glucose_level for different ages

Now we are going to jump to a slightly different technique of data visualization

Quantile and mean dependence plots

When the target is numerical (regression)

Suppose we have a regression task with target variable y and we want to see what is the dependency on y of some feature F. To do this we binarize the feature F on bins B_i and then we compute statistics (like the mean, or median, or some other quantile) of y on every bin and then plot the result to see that dependency. We will call y[B_i] to the statistic computed on the bin B_i:

example of mean statistic on bin **B_i**

By plotting the function y[B_i] we can gain insights into that dependency:

Example for this visualization for the mean, quantile 10%, 90%, and median. We see a clear dependency on the feature where the variance gets bigger as the feature increases. We also can see that the distribution of y is very symmetrical because the mean and the median are very close

The example shown above was created using these lines of codes:

For a real data example we explore the relationship between age and bmi on the stroke data set using this line of code:

we see an almost increasing dependency between **age** and **bmi** except for the last bin of 60 to 100 years.

Mathematically/Statistically what we are trying to estimate here are statistics under the conditional probability P( ·|F=x) for different values x of the feature F. For example in the case of the mean statistic what we are trying to estimate is the function shown below:

Conditional expectation of y

This method is very similar to partial dependence plots (PDPs) used for interpretability on machine learning algorithms but with the difference that in PDPs we take statsitstics over a model output and not directly over the observed data distribution. For more details see this article

For a categorical feature, we follow the same logic with the difference that we don’t need to binarize and we use bar plots instead of lines.

With the next line of code, we can see the relationship between work_type and bmi:

When the target is categorical (classification)

For a classification task (with target variable y) what we do is to see, in each bin, the proportion of the data points that has a particular target class c. So for the mean statistic, we compute the following value

The proportion of data points on **B_i** that has target **y=c**.

In this case, to get a notion of confidence interval of the proportion computed above, we need to do a little more work than the regression task case (where we only needed to compute some lower and upper quantiles). To get this confidence interval we model the proportion statistic as a Binomial random variable (where n=|B_i| and p=y[B_i] in the formula shown above) and then estimate the confidence interval for the proportion using this method that gives more broad intervals if n is smaller (less confidence for the estimation of the proportion).

Let's see this method in practice with our stroke dataset.

We first explore the relationship of age and the stroke categorical target using these lines of code (we set a confidence level of 0.95)

We get this plot:

The proportions of people with stroke history for each age bin.

Similarly, for the categorical feature work_type, we can use these lines:

Conclusion

Data visualization is a very important tool for data scientists and data practitioners in general that can help us to understand the behavior of the system that we are dealing with, gain quick insights and data-driven storytelling, debugging our data preprocessing pipelines and models, and impose some constraints and prior knowledge to our ML model.

In this article, we went in-depth on some easy-to-use methods to visualize, interact, and understand some of the distributions and dependencies of our variables and to ultimately decide which of the variables should we pick for our model.