Machine Learning Visualizations with Yellowbrick
An open-source python toolkit to accelerate your Model Selection with visual analysis and diagnostic tools.
Every now and then we come across different python packages that are designed to simplify our tasks and help us with modeling ML algorithms, but not all of them manage to make a space in our regular Machine Learning workflow. Yellowbrick is one such library that I came across around a month ago and it has certainly incorporated its place in my ML toolkit. In this blog, we’ll see what Yellowbrick is and how it can improve our model understanding and make the model selection process easier.
Introduction
Yellowbrick is an open-source python project that wraps the scikit-learn and matplotlib APIs to create publication-ready figures and interactive data explorations. It is basically a diagnostic visualization platform for machine learning that allows us to steer the model selection process by helping to evaluate the performance, stability, and predictive value of machine learning models and further assist in diagnosing the problems in our workflow.
Installation
The simplest way to install Yellowbrick is from PyPI with pip, Python’s preferred package installer.
$ pip install yellowbrick
In order to upgrade Yellowbrick to the latest version, you use pip
as :
$ pip install -U yellowbrick
Fun-fact
The Yellowbrick package gets its name from the fictional element in the 1900’s novel The Wonderful Wizard of Oz. In the book, the yellow brick road is the path that the protagonist must travel in order to reach her destination in the Emerald City.
Using Yellowbrick
The primary interface in yellowbrick is a Visualizer
– an object that learns from data to produce a visualization. Visualizers are scikit-learn Estimator objects that have a similar interface along with methods for drawing. In order to use visualizers, you simply use the same workflow as with a scikit-learn model, import the visualizer, instantiate it, call the visualizer’s fit()
method, then to render it, call the visualizer’s show()
method.
Some of the popular visualizers are:
- Feature Analysis Visualizers
- Target Visualizers
- Regression Visualizers
- Classification Visualizers
- Clustering Visualizers
- Model Selection Visualizers
- Text Visualizers
We’ll code and implement all of these visualizers one by one:
1. Feature Analysis Visualizers
Feature analysis visualizers are used to detect features or targets that might impact downstream fitting. Here, we’ll use the Rank1D
and Rank2D
features to evaluate single features and pairs of features using a variety of metrics that score the features on the scale [-1, 1] or [0, 1].
A one-dimensional ranking of features [Rank1D] utilizes a ranking algorithm that takes into account only a single feature at a time
from yellowbrick.datasets import load_credit
from yellowbrick.features import Rank1D
# Load the credit dataset
X, y = load_credit()
# Instantiate the 1D visualizer with the Sharpiro ranking algorithm
visualizer = Rank1D(algorithm='shapiro')
visualizer.fit(X, y) # Fit the data to the visualizer
visualizer.transform(X) # Transform the data
visualizer.show() # Finalize and render the figure# Note: I have used the yellowbrick's pre-loaded datasets to implement all the visualizers.
A two-dimensional ranking of features [Rank 2D] utilizes a ranking algorithm that takes into account pairs of features at a time
from yellowbrick.datasets import load_credit
from yellowbrick.features import Rank2D
# Load the credit dataset
X, y = load_credit()
# Instantiate the visualizer with the Pearson ranking algorithm
visualizer = Rank2D(algorithm='pearson')
visualizer.fit(X, y) # Fit the data to the visualizer
visualizer.transform(X) # Transform the data
visualizer.show() # Finalize and render the figure
2. Target Visualizers
These visualizers specialize in visually describing the dependent variable for supervised modeling, often referred to as y
or the target. Here, we’ll look at the Class balance visualizer. Imbalance of classes in the training data is one of the biggest challenges for Classification models and before we start dealing with it, it is important to understand what the class balance is in the training data.
The ClassBalance
visualizer supports this by creating a bar chart of the support for each class, which is the frequency of the classes’ representation in the dataset.
from yellowbrick.datasets import load_game
from yellowbrick.target import ClassBalance
# Load the classification dataset
X, y = load_game()
# Instantiate the visualizer
visualizer = ClassBalance(labels=["draw", "loss", "win"])
visualizer.fit(y) # Fit the data to the visualizer
visualizer.show() # Finalize and render the figure
3. Regression Visualizer
Regression models attempt to predict a target in a continuous space. Regressor score visualizers display the instances in model space to help us better understand how the model is making predictions. In this blog, we’ll look at the Prediction Error Plot which plots the expected vs. actual values in model space.
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Lasso
from yellowbrick.datasets import load_concrete
from yellowbrick.regressor import PredictionError
# Load a regression dataset
X, y = load_concrete()
# Create the train and test data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Instantiate the linear model and visualizer
model = Lasso()
visualizer = PredictionError(model)
visualizer.fit(X_train, y_train) # Fit the training data
visualizer.score(X_test, y_test) # Evaluate the model
visualizer.show() # Render the figure
4. Classification Visualizer
Classification models attempt to predict a target in a discrete space, that is assigned an instance of dependent variables one or more categories. The Classification score visualizers display the differences between classes as well as a number of classifier-specific visual evaluations.
We’ll look at the Confusion Matrix Visualizer with the quick method confusion_matrix,
it will build the ConfusionMatrix
object with the associated arguments, fit it, and then we can render it.
from yellowbrick.datasets import load_credit
from yellowbrick.classifier import confusion_matrix
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split as tts
#Load the classification dataset
X, y = load_credit()
#Create the train and test data
X_train, X_test, y_train, y_test = tts(X, y, test_size=0.2)
# Instantiate the visualizer with the classification model
confusion_matrix(
LogisticRegression(),
X_train, y_train, X_test, y_test,
classes=['not_defaulted', 'defaulted']
)
plt.tight_layout()
5. Clustering Visualizer
Clustering models are unsupervised methods that attempt to detect patterns in unlabeled data. Yellowbrick provides the yellowbrick.cluster
module to visualize and evaluate clustering behavior.
The KElbowVisualizer
helps us select the optimal number of clusters by fitting the model with a range of values for ‘k’. If the line chart resembles an arm, then the “elbow” (the point of inflection on the curve) is a good indication that the underlying model fits best at that point. In this visualizer, “elbow” will be annotated with a dashed line.
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
from yellowbrick.cluster import KElbowVisualizer
# Generate synthetic dataset with 8 random clusters
X, y = make_blobs(n_samples=1000, n_features=12, centers=8, random_state=42)
# Instantiate the clustering model and visualizer
model = KMeans()
visualizer = KElbowVisualizer(model, k=(4,12))
visualizer.fit(X) # Fit the data to the visualizer
visualizer.show() # Finalize and render the figure
6. Model Selection Visualizer:
The yellowbrick.model_selection
package provides us with visualizers for inspecting the performance of cross-validation and hyperparameter tuning. Many visualizers wrap functionality found in sklearn.model_selection
and others build upon it for performing multi-model comparisons.
Model validation is used to determine how effective an estimator is on data that it has been trained on as well as how generalizable it is to new input. To measure a model’s performance we first split the dataset into training and test splits, fitting the model on the training data and scoring it on the reserved test data. the hyperparameters of the model must be selected which best allows the model to operate in the specified feature space to maximize its score.
In my example, we’ll explore using the ValidationCurve
visualizer with a regression dataset.
import numpy as np
from yellowbrick.datasets import load_energy
from yellowbrick.model_selection import ValidationCurve
from sklearn.tree import DecisionTreeRegressor
# Load a regression dataset
X, y = load_energy()
viz = ValidationCurve(
DecisionTreeRegressor(), param_name="max_depth",
param_range=np.arange(1, 11), cv=10, scoring="r2"
)
# Fit and show the visualizer
viz.fit(X, y)
viz.show()
7. Text Modelling Visualizer:
The last visualizer that we’ll check out in this blog is from the yellowbrick.text
module for text-specific visualizers. The TextVisualizer
class specifically deals with datasets that are corpora and not simple numeric arrays or DataFrames, providing utilities for analyzing word dispersion and distribution, showing document similarity, or simply wrapping some of the other standard visualizers with text-specific display properties. Here, we’ll check out the FrequencyVisualizer for Token Frequency Distribution.
from sklearn.feature_extraction.text import CountVectorizer
from yellowbrick.text import FreqDistVisualizer
from yellowbrick.datasets import load_hobbies
# Load the text data
corpus = load_hobbies()
vectorizer = CountVectorizer()
docs = vectorizer.fit_transform(corpus.data)
features = vectorizer.get_feature_names()
visualizer = FreqDistVisualizer(features=features, orient='v')
visualizer.fit(docs)
visualizer.show()
Conclusion:
With this, we have covered some of the commonly used visualizers from Yellowbrick that can prove to be very useful for us. But this is not the limit of Yellowbrick, it has many more visualizers available in each category such as RadViz, PCA projection, Feature Correlation, Residual Plot, Cross-Validation score, and more which are equally useful and convenient to use. You can check them out in the Yellowbrick Documentation.
I hope this blog was useful in introducing you all to Yellowbrick library and now you will be able to incorporate it in your ML workflow.
For the complete documentation of Yellowbrick, Click here.