Machine Learning Visualizations with Yellowbrick

An open-source python toolkit to accelerate your Model Selection with visual analysis and diagnostic tools.

Hritik Bhandari
DataX Journal
8 min readApr 24, 2020

--

Every now and then we come across different python packages that are designed to simplify our tasks and help us with modeling ML algorithms, but not all of them manage to make a space in our regular Machine Learning workflow. Yellowbrick is one such library that I came across around a month ago and it has certainly incorporated its place in my ML toolkit. In this blog, we’ll see what Yellowbrick is and how it can improve our model understanding and make the model selection process easier.

Introduction

Yellowbrick is an open-source python project that wraps the scikit-learn and matplotlib APIs to create publication-ready figures and interactive data explorations. It is basically a diagnostic visualization platform for machine learning that allows us to steer the model selection process by helping to evaluate the performance, stability, and predictive value of machine learning models and further assist in diagnosing the problems in our workflow.

Installation

The simplest way to install Yellowbrick is from PyPI with pip, Python’s preferred package installer.

In order to upgrade Yellowbrick to the latest version, you use pip as :

Fun-fact

The Yellowbrick package gets its name from the fictional element in the 1900’s novel The Wonderful Wizard of Oz. In the book, the yellow brick road is the path that the protagonist must travel in order to reach her destination in the Emerald City.

Model Selection Triple

Using Yellowbrick

The primary interface in yellowbrick is a Visualizer – an object that learns from data to produce a visualization. Visualizers are scikit-learn Estimator objects that have a similar interface along with methods for drawing. In order to use visualizers, you simply use the same workflow as with a scikit-learn model, import the visualizer, instantiate it, call the visualizer’s fit() method, then to render it, call the visualizer’s show() method.

Some of the popular visualizers are:

  • Feature Analysis Visualizers
  • Target Visualizers
  • Regression Visualizers
  • Classification Visualizers
  • Clustering Visualizers
  • Model Selection Visualizers
  • Text Visualizers

We’ll code and implement all of these visualizers one by one:

1. Feature Analysis Visualizers

Feature analysis visualizers are used to detect features or targets that might impact downstream fitting. Here, we’ll use the Rank1D and Rank2D features to evaluate single features and pairs of features using a variety of metrics that score the features on the scale [-1, 1] or [0, 1].

A one-dimensional ranking of features [Rank1D] utilizes a ranking algorithm that takes into account only a single feature at a time

Rank 1D Visualizer

A two-dimensional ranking of features [Rank 2D] utilizes a ranking algorithm that takes into account pairs of features at a time

Rank 2D Visualizer

2. Target Visualizers

These visualizers specialize in visually describing the dependent variable for supervised modeling, often referred to as y or the target. Here, we’ll look at the Class balance visualizer. Imbalance of classes in the training data is one of the biggest challenges for Classification models and before we start dealing with it, it is important to understand what the class balance is in the training data.

The ClassBalance visualizer supports this by creating a bar chart of the support for each class, which is the frequency of the classes’ representation in the dataset.

Class Balance Visualizer

3. Regression Visualizer

Regression models attempt to predict a target in a continuous space. Regressor score visualizers display the instances in model space to help us better understand how the model is making predictions. In this blog, we’ll look at the Prediction Error Plot which plots the expected vs. actual values in model space.

Prediction Error Plot

4. Classification Visualizer

Classification models attempt to predict a target in a discrete space, that is assigned an instance of dependent variables one or more categories. The Classification score visualizers display the differences between classes as well as a number of classifier-specific visual evaluations.

We’ll look at the Confusion Matrix Visualizer with the quick method confusion_matrix, it will build the ConfusionMatrix object with the associated arguments, fit it, and then we can render it.

Confusion Matrix Visualizer

5. Clustering Visualizer

Clustering models are unsupervised methods that attempt to detect patterns in unlabeled data. Yellowbrick provides the yellowbrick.cluster module to visualize and evaluate clustering behavior.

The KElbowVisualizer helps us select the optimal number of clusters by fitting the model with a range of values for ‘k’. If the line chart resembles an arm, then the “elbow” (the point of inflection on the curve) is a good indication that the underlying model fits best at that point. In this visualizer, “elbow” will be annotated with a dashed line.

K elbow Visualizer

6. Model Selection Visualizer:

The yellowbrick.model_selection package provides us with visualizers for inspecting the performance of cross-validation and hyperparameter tuning. Many visualizers wrap functionality found in sklearn.model_selection and others build upon it for performing multi-model comparisons.

Model validation is used to determine how effective an estimator is on data that it has been trained on as well as how generalizable it is to new input. To measure a model’s performance we first split the dataset into training and test splits, fitting the model on the training data and scoring it on the reserved test data. the hyperparameters of the model must be selected which best allows the model to operate in the specified feature space to maximize its score.

In my example, we’ll explore using the ValidationCurve visualizer with a regression dataset.

Validation Curve

7. Text Modelling Visualizer:

The last visualizer that we’ll check out in this blog is from the yellowbrick.text module for text-specific visualizers. The TextVisualizer class specifically deals with datasets that are corpora and not simple numeric arrays or DataFrames, providing utilities for analyzing word dispersion and distribution, showing document similarity, or simply wrapping some of the other standard visualizers with text-specific display properties. Here, we’ll check out the FrequencyVisualizer for Token Frequency Distribution.

Frequency Visualizer

Conclusion:

With this, we have covered some of the commonly used visualizers from Yellowbrick that can prove to be very useful for us. But this is not the limit of Yellowbrick, it has many more visualizers available in each category such as RadViz, PCA projection, Feature Correlation, Residual Plot, Cross-Validation score, and more which are equally useful and convenient to use. You can check them out in the Yellowbrick Documentation.

I hope this blog was useful in introducing you all to Yellowbrick library and now you will be able to incorporate it in your ML workflow.

For the complete documentation of Yellowbrick, Click here.

--

--

Hritik Bhandari
DataX Journal

CSE undergrad and tech enthusiast working in the field of Data Science, Machine Learning, and Web development. hritikbhandari.tech