Mlxtend — Plotting Made Easier

Pratyaksh Bhalla
DataX Journal
Published in
10 min readMar 23, 2020
Photo by Luke Chesser on Unsplash

Introduction:

Every little bit and piece of Exploratory Analysis, Every step, and Every code written towards the modeling of a machine learning algorithm is completely based on the plots, graphs, and visualizations of the data set and features. Mlxtend makes that not only possible but extremely easy. Plots like category scatter plots, decision region plots, cumulative distribution functions and other complicated data visualizations are nothing more than features of the library that with few parameter definitions perform those complicated graphs in a few lines of code.

For the complete documentation of the Mlxtend.plotting library, click here

We will be going through the following plots in this article but the mlxtend library has its grasps beyond this article and the plots stated here we would encourage you to go ahead and take a look in those plots as well. This is an extremely underused library but it is extremely helpful. We promise!!

Lets Begin:

category_scatter:

This is a scatter plot with category influence, it plots the various data points with their respective categories.

import pandas as pd
from io import StringIO
csvfile = """label,x,y
class1,10.0,8.04
class1,10.5,7.30
class2,8.3,5.5
class2,8.1,5.9
class3,3.5,3.5
class3,3.8,5.1
class1, 10.0, 5.5
class 2, 10, 6.3
class1, 7.9, 5.2
class3, 1.2, 5.9
class3, 10.0,0.3
class1, 7.2, 0.4
class2,5.0, 0.2
class1, 5.0, 7.0
class2, 10, 0.9
class3, 10, 8.9
class2, 5.3, 1.9
class1, 2.0, 9.8
class3, 3.0, 4.2
class1,12.0,9.04
class1,4.5,6.30
class2,6.3,3.5
class2,7.1,4.9
class3,8.5,2.5
class3,2.8,4.1
class1, 9.0, 4.5
class 2, 9.1, 9.3
class1, 8.2, 3.2
class3, 10.2, 1.9"""
df = pd.read_csv(StringIO(csvfile))
df

This is the making of the dataset and we will be filling in the parameters in the category_scatter() function

parameters:

  • x : str or int

DataFrame column name of the x-axis values or integer for the variables

  • y : str

DataFrame column name of the y-axis values or integer for the variables

  • data : Pandas DataFrame object
  • markers : str

Markers that are cycled through the label category.

  • colors : tuple

Colors that are cycled through the label category.

  • alpha : float (default: 0.7)

Parameter to control the transparency.

  • markersize : float (default` : 20.0)

Parameter to control the marker size.

  • legend_loc : str (default: 'best')

Location of the plot legend {best, upper left, upper right, lower left, lower right} No legend if legend_loc=False

AND NOW COMES THE CODE:

from mlxtend.plotting import category_scatter  fig = category_scatter(x='x', y='y', label_col='label',                         data=df, legend_loc='upper left')

This simple line of code can perform the category scatter plot for the above given .csv dataframe

category scatter plot.

checkerboard_plot:

Creating the dataset for the checkerboard plot:

A checkerboard plot is nothing but a 2-d representation of the matrix of n x m dimensions.

from mlxtend.plotting import checkerboard_plot
import matplotlib.pyplot as plt
import numpy as np
ary = np.random.random((5, 4))

Looking into the parameters:

  • ary : array-like, shape = [n, m]

A 2D array.

  • cell_colors : tuple or list (default: ('white', 'black'))

Tuple or list containing the two colors of the checkerboard pattern.

  • font_colors : tuple or list (default: ('black', 'white'))

Font colors corresponding to the cell colors.

  • figsize : tuple (default: (2.5, 2.5))

Height and width of the figure

  • fmt : str (default: '%.1f')

Python string formatter for cell values. The default ‘%.1f’ results in floats with 1 digit after the decimal point. Use ‘%d’ to show numbers as integers.

  • row_labels : list (default: None)

List of the row labels. Uses the array row indices 0 to n by default.

  • col_labels : list (default: None)

List of the column labels. Uses the array column indices 0 to m by default.

  • fontsize : int (default: None)

Specifies the font size of the checkerboard table. Uses matplotlib’s default if None.

If we apply some of these parameters to our checkerboard_plot() function we can beautify the visualization :

plotting the checkerboard plot:

checkerboard_plot(ary, 
col_labels=['category %d' % i for i in range(1, 6)],
row_labels=['sample %d' % i for i in range(1, 6)],
cell_colors=['red', 'black'],
font_colors=['black', 'white'],
figsize=(7.5, 5),
fontsize = 200)

heatmap:

For this particular plot we will be using the data set from here.

The heatmap is a graphical representation of the data distribution and density of data throughout the matrix

Here is the data production

import numpy as np
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/rasbt/'
'python-machine-learning-book-2nd-edition'
'/master/code/ch10/housing.data.txt',
header=None,
sep='\s+')
df.columns = ['sample %d' % i for i in range(1, 15)]
df.head()
dataset

heatmap parameters:

  • conf_mat : array-like, shape = [n_rows, n_columns]. Basically, a 2D array.
  • hide_spines : bool (default: False)

Hides axis spines if True.

  • hide_ticks : bool (default: False)

Hides axis ticks if True

  • figsize : tuple (default: (2.5, 2.5))

Height and width of the figure

  • cmap : matplotlib colormap (default: None)

Provides predefined color sequences to the map

  • colorbar : bool (default: True)

Shows a colorbar if True

  • row_names : array-like, shape = [n_rows] (default: None)

List of row names to be used as y-axis tick labels.

  • column_names : array-like, shape = [n_columns] (default: None)

List of column names to be used as x-axis tick labels.

  • column_name_rotation : int (default: 45)

Number of degrees for rotating column x-tick labels.

  • cell_fmt : string (default: '.2f')

Format specification for cell values.

  • cell_font_size : int (default: None)

Font size for cell values

The code for the display of the heat map:

from mlxtend.plotting import heatmap
import matplotlib.pyplot as plt
cols = ['sample 1', 'sample 5', 'sample 9', 'sample 12', 'sample 14']
cm = np.corrcoef(df[cols].values.T)
heatmap(cm,
column_names=cols,
row_names=cols,
cmap = 'magma',
figsize =(7.5, 7.5),
cell_font_size = 20)
plt.show()
heatmap

plot_confusion_matrix:

As the name suggests this method of the library plots the confusion matrix it is extremely straight forward

code for the given dataset:

import numpy as np
cm1 = np.array([[1, 4, 5], [4, 7, 9], [2, 9, 8]])

PARAMETERS:

  • conf_mat : array-like, shape = [n_classes, n_classes]

Confusion matrix input

  • hide_spines : bool (default: False)

Hides axis spines if True.

  • hide_ticks : bool (default: False)

Hides axis ticks if True

  • figsize : tuple (default: (2.5, 2.5))

Height and width of the figure

  • cmap : matplotlib colormap (default: None)

color sequence

  • colorbar : bool (default: False)

Shows a color bar if True

  • show_absolute : bool (default: True)

Shows absolute confusion matrix coefficients if True. At least one of show_absolute or show_normed must be True.

  • show_normed : bool (default: False)

Shows normed confusion matrix coefficients if True. The normed confusion matrix coefficients give the proportion of training examples per class that are assigned the correct label. At least one of show_absolute or show_normed must be True.

  • class_names : array-like, shape = [n_classes] (default: None)

List of class names. If not None, ticks will be set to these values. Returns

Code for the confusion matrix plot:

from mlxtend.plotting import plot_confusion_matrixplot_confusion_matrix(cm1,cmap = 'winter_r',figsize = (7.5, 7.5))
confusion matrix

plot_decision_regions:

This is the most important and heroic of the methods in fact the other methods are also available in other libraries but this particular method reduces a 14 lines code in 2 lines

We will be using the iris dataset for this plot thsu the code for the dataset is pretty straight forward:

from sklearn import datasets
# Loading some example data
iris = datasets.load_iris()
X = iris.data[:, [0, 2]]
y = iris.target

This is a simple classification problem and hence is very straight forward to use a classifier; We will be using a logistic regression

from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(random_state = 0)
classifier.fit(X, y)

Unlike other plots I will be showing the code to plot the decision regions without the library first:

from matplotlib.colors import ListedColormap
X_set, y_set = X, y
X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 1, stop = X_set[:, 0].max() + 1, step = 0.01),
np.arange(start = X_set[:, 1].min() - 1, stop = X_set[:, 1].max() + 1, step = 0.01))
plt.contourf(X1, X2, classifier.predict(np.array([X1.ravel(), X2.ravel()]).T).reshape(X1.shape),
alpha = 0.75, cmap = ListedColormap(('black', 'green', 'blue')))
plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max())
for i, j in enumerate(np.unique(y_set)):
plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1],
c = ListedColormap(('red', 'green', 'blue'))(i), label = j)
plt.title('Logistic Regression')
plt.xlabel('X')
plt.ylabel('y')
plt.legend()
plt.show()
contour plot

The above code and plot is extremely tough to understand and comprehend as well thus Now we use the mlxtend library

from mlxtend.plotting import plot_decision_regions
plot_decision_regions(X, y ,clf = classifier, zoom_factor= 10)
plot_decision_regions

Isn’t it simpler, far more convenient and beautiful. So now let's see the parameters:

  • X : array-like, shape = [n_samples, n_features]

Feature Matrix.

  • y : array-like, shape = [n_samples]

True class labels.

  • clf : Classifier object.
  • feature_index : array-like (default: (0,) for 1D, (0, 1) otherwise)

Feature indices to use for plotting. The first index in feature_index will be on the x-axis, the second index will be on the y-axis.

  • filler_feature_values : dict (default: None)

Only needed for number features > 2. Dictionary of feature index-value pairs for the features not being plotted.

  • filler_feature_ranges : dict (default: None)

Only needed for number features >2. Dictionary of feature index-value pairs for the features not being plotted. Will use the ranges provided to select training samples for plotting.

  • X_highlight : array-like, shape = [n_samples, n_features] (default: None)

An array with data points that are used to highlight samples in X.

  • res : float or array-like, shape = (2,) (default: None)

This parameter was used to define the grid width, but it has been deprecated in favor of determining the number of points given the figure DPI and size automatically for optimal results and computational efficiency. To increase the resolution, it’s is recommended to use to provide a dpi argument via matplotlib, e.g.,plt.figure(dpi=600)`.

  • zoom_factor : float (default: 1.0)

Controls the scale of the x- and y-axis of the decision plot.

  • hide_spines : bool (default: True)

Hide axis spines if True.

  • legend : int (default: 1)

Integer to specify the legend location. No legend if legend is 0.

  • markers : str

Scatterplot markers.

  • colors : str (default: 'red, blue, limegreen, gray, cyan')

Comma-separated list of colors.

  • scatter_kwargs : dict (default: None)

Keyword arguments for underlying matplotlib scatter function.

  • contourf_kwargs : dict (default: None)

Keyword arguments for underlying matplotlib contourf function.

  • scatter_highlight_kwargs : dict (default: None)

Keyword arguments for underlying matplotlib scatter function.

plot_learning_curves

A very useful plot is to understand the learning standards and curves for a training/development and testing set of the dataframe. This method simplifies the operation as instead of making different instances plots and plotting them together we just use the method with the right parameters. let's see how it's done.

code for the dataset:

from mlxtend.preprocessing import shuffle_arrays_unison
from mlxtend.data import iris_data
from sklearn.neighbors import KNeighborsClassifier
X, y = iris_data()
X, y = shuffle_arrays_unison(arrays=[X, y], random_seed=123)
X_train, X_test = X[:100], X[100:]
y_train, y_test = y[:100], y[100:]
clf = KNeighborsClassifier(n_neighbors=5)

This is a simple classification problem and we are using the K nearest neighbors method to solve the classification problem

Parameters:

  • X_train : array-like, shape = [n_samples, n_features]

Feature matrix of the training dataset.

  • y_train : array-like, shape = [n_samples]

True class labels of the training dataset.

  • X_test : array-like, shape = [n_samples, n_features]

Feature matrix of the test dataset.

  • y_test : array-like, shape = [n_samples]

True class labels of the test dataset.

  • clf : Classifier object.
  • train_marker : str (default: 'o')

Marker for the training set line plot.

  • test_marker : str (default: '^')

Marker for the test set line plot.

  • suppress_plot=False : bool (default: False)

Suppress matplotlib plots if True. Recommended for testing purposes.

  • print_model : bool (default: True)

Print model parameters in plot title if True.

  • style : str (default: 'fivethirtyeight')

Matplotlib style

  • legend_loc : str (default: 'best')

Where to place the plot legend: {‘best’, ‘upper left’, ‘upper right’, ‘lower left’, ‘lower right’}

Code for the plot:

from mlxtend.plotting import plot_learning_curves
plot_learning_curves(X_train, y_train, X_test, y_test, clf)
plt.show()

It's pretty clear and extremely helpful plot, it shows the exact learning plot and testing curve, every analyst uses these plots extensively, and this library does what you need.

plot_linear_regression:

We came across multiple graphs for classification and other models but when it comes to linear regression we cannot overlook the visualization and neither does mlxtend. So lets code for the dataset

import numpy as np  
X = np.array([4, 8, 13, 26, 31, 10, 8, 30, 18, 12, 20, 5, 28, 18, 6, 31, 12, 12, 27, 11, 6, 14, 25, 7, 13,4, 15, 21, 15])
y = np.array([14, 24, 22, 59, 66, 25, 18, 60, 39, 32, 53, 18, 55, 41, 28, 61, 35, 36, 52, 23, 19, 25, 73, 16, 32, 14, 31, 43, 34])

Parameters:

  • X : numpy array, shape = [n_samples,]

Samples.

  • y : numpy array, shape (n_samples,)

Targets

import matplotlib.pyplot as plt 
from mlxtend.plotting import plot_linear_regression
intercept, slope, corr_coeff = plot_linear_regression(X, y) plt.show()

CONCLUSION:

With this, we reach towards the end of the library and its capabilities. We really hope that this article and the underdog library mlxtend makes your analysis easier. Stay tuned for further helpful and exciting blogs in the field of Data Science. Till then, happy analysis!!!

--

--

Pratyaksh Bhalla
DataX Journal

Data analyst | Machine Learning Enthusiast | Programmer