Important Machine Learning Data Visualization Tools

A guide to understand what your model is telling you.

Nabil M Abbas
Sep 24, 2019 · 6 min read

Seaborn and Matplotlib

Two important packages you will be using during your data science journey are Seaborn and Matplotlib. They are both very useful tools for Exploratory Data Analysis and creating Data Visualizations.

I recently completed a project in which I was exploring Spotify’s API for a classification machine learning model I was creating. The models I implemented used a number of features that ultimately had an impact on the models’ ability to interpret my data.

The first step of the project (and frankly any project you may find your self doing) was exploratory data analysis otherwise known as EDA. Before running any models I needed to understand my data. What did my data mean? Were any of the features uniquely interacting with one another? How were my features distributed? What features may be redundant? What data needs to be converted to a categorical value?

These are all important questions Data Scientists need to be able to answer about their project data. It is naive to assume your data is perfect and ready to be thrown into a model if you have not done any exploratory data analysis yet. But that’s why we’re here!

Data Visualization tools help us understand our data before and after we run it through a model, giving us insight as to what changes may need to be made going forward.

Without further ado, here is some code and visualizations I ran for my recent Spotify project to validate some of my analyses. I’ve provided some code to create the visualizations along with the outputted visualization from the code when I was working on the project. Much of the code is credited to Flatiron School staff and fellow classmates who were very collaborative in assisting me with my project.

Double Bar Graphs For Categorical Data

This is really helpful to understand how a binary dependent variable is distributed across a binary categorical feature. When I visit this project in the future I will likely consider plotting all the time signatures in one bar graph to compare how each of them were distributed amongst the “speechy” and musical categories.

Blue indicates a binary value of 1 and Orange indicating a binary value of 0. The blue is data with time signature 5. The orange is data without time signature 5.
xs1_5 = len(speechy[speechy['time_signature_4'] == 1]) #time signature is 5
xs0_5 = len(speechy[speechy['time_signature_4'] == 0]) #time signature is NOT 5
xm1_5 = len(musical[musical['time_signature_4'] == 1]) #time signature is 5
xm0_5 = len(musical[musical['time_signature_4'] == 0]) #time signature is NOT 5
X = ['Speechy','Musical']
Y = [xs1_5, xm1_5]
Z = [xs0_5, xm0_5]
_X = np.arange(len(X))
plt.bar(_X - 0.2, Y, 0.4)
plt.bar(_X + 0.2, Z, 0.4)
plt.xticks(_X, X)# set labels manually
plt.ylabel('Time_Signature_4')
plt.title('Blue = 1 Orange = 0')
plt.show()

Feature Importance

This visualization is very helpful because it ranks all the features that were run through your model and how relevant they were in determining your predictions. This visualization comes after you run your model, which makes sense because you would need a model to take in the features and see how they perform.

My code is in need of updating but I found this terrific resource w/ code provided from scikit-learn.org to plot your feature importance!

https://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances.html

Confusion Matrix

Confusion matrices help because they visualize your predictions against your actual values. Within the matrix you can see the count of your True Positives, True Negatives, False Positives and False Negatives.

def plot_confusion_matrix(y_true, y_pred, normalize=False, title=None, cmap=plt.cm.Reds):
labels = ['Speechy', 'Musical']
# Compute confusion matrix
cmat = pd.crosstab(y_true, y_pred, rownames=['True'], colnames=['Predicted'], margins=True)
print(cmat)
cm = confusion_matrix(y_true, y_pred)
fig, ax = plt.subplots()
im = ax.imshow(cm, interpolation='nearest', cmap=cmap)
ax.figure.colorbar(im, ax=ax)
# We want to show all ticks...
ax.set(xticks=np.arange(cm.shape[1]),
yticks=np.arange(cm.shape[0]),
# ... and label them with the respective list entries
xticklabels=labels, yticklabels=labels,
title=title,
ylabel='True Classification',
xlabel='Predicted Classification')
# Rotate the tick labels and set their alignment.
plt.setp(ax.get_xticklabels(), rotation=45, ha="right",
rotation_mode="anchor")
# Loop over data dimensions and create text annotations.
fmt = '.2f' if normalize else 'd'
thresh = cm.max() / 2.
for i in range(cm.shape[0]):
for j in range(cm.shape[1]):
ax.text(j, i, format(cm[i, j], fmt),
ha="center", va="center",
color="white" if cm[i, j] > thresh else "black")
fig.tight_layout()
return ax
np.set_printoptions(precision=1)
# Plot non-normalized confusion matrix
plt.show()

Random Forest Trees

Although I didn’t create a random forest visualization for my model. I found this very helpful author who provided code and a visualization for a random forest classifier for the classic iris data set. Have a look!

https://towardsdatascience.com/how-to-visualize-a-decision-tree-from-a-random-forest-in-python-using-scikit-learn-38ad2d75f21c

ROC and AUC

ROC stands for “Receiver Operating Characteristic” curves. AUC stands for “Area Under the Curve. ROC curves illustrate the true positive rate against the false positive rate of classifier model. Ideally a good ROC Curves should “hug” the upper left portion of the plot as such:

If you’re interested in the math behind true positive rate and false positive rate, have a look:

True Positive Rate
False Positive Rate

As stated earlier the curve should hug the top left of the chart, maximizing the AUC (area under the curve). The closer to 1 the better. I created a ROC curve for my Random Forest Classifier as shown in the following figure. There is additional EDA that is needed to be done to optimize the model and hence maximizing the AUC. But the main takeaway is that this is what a ROC curve looks like. The AUC should give you a numerical idea of the extent your model may be indicating false positives.

AUC (pred_proba): 0.7108428446005266
def ROC_func(clf, X_test, y_test):

y_pred_prob = clf.predict_proba(X_test)[:, 1]
fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)
plt.figure(figsize=(7,7))
plt.plot(fpr, tpr)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.title(f'ROC Curve - {clf.__class__.__name__}')
plt.xlabel('False Positive Rate (1 - Specificity)')
plt.ylabel('True Positive Rate (Sensitivity)')
plt.grid(True)
print('AUC (pred_proba): ', roc_auc_score(y_test, y_pred_prob))

Kernel Density Estimation (KDE)

KDE plots are very helpful to understand whether or not your data is normally distributed or not. Within the scope of my project, I was able to view and understand how valence was distributed for musical and “speechy” category data.

Distribution of Valence for Musical and Speechy Data
sns.kdeplot(df2[['valence', 'SPEECH_musical']][df2['SPEECH_musical'] == 1]['valence'], shade = True, label="musical")
sns.kdeplot(df2[['valence', 'SPEECH_musical']][df2['SPEECH_musical'] == 0]['valence'], shade = True, label="speechy")

Additional Visualizations for Data Explorations

There are so many other tools available to assist data scientists to visualize their data. I found the seaborn website was very helpful for me to create some of these visualizations while I was working on my project.

https://seaborn.pydata.org/tutorial/distributions.html

These tools all come together in a helpful way to communicate your analysis findings in a way that can be understood. So mastery of data visualization is a must have skill set for aspiring data scientists.

Sources:

The Startup

Get smarter at building your thing. Join The Startup’s +785K followers.

Sign up for Top 10 Stories

By The Startup

Get smarter at building your thing. Subscribe to receive The Startup's top 10 most read stories — delivered straight into your inbox, once a week. Take a look.

By signing up, you will create a Medium account if you don’t already have one. Review our Privacy Policy for more information about our privacy practices.

Check your inbox
Medium sent you an email at to complete your subscription.

Nabil M Abbas

Written by

Data Scientist, with a background in Mechanical Engineering from NYU. Interests include sports, mental health, humanitarian support and tech news.

The Startup

Get smarter at building your thing. Follow to join The Startup’s +8 million monthly readers & +785K followers.

Nabil M Abbas

Written by

Data Scientist, with a background in Mechanical Engineering from NYU. Interests include sports, mental health, humanitarian support and tech news.

The Startup

Get smarter at building your thing. Follow to join The Startup’s +8 million monthly readers & +785K followers.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store