Understanding Marketing Analytics in Python. [Part 5] — Exploratory Data Analysis Using the Subplot Function And Scatterplot Matrices with example and code.

Kamna Sinha
Data At The Core !
Published in
5 min readSep 21, 2023

This is part 5 of the series on Marketing Analytics, have a look at the entire series introduction with details of each part here.

This is the 5th part of the Marketing analytics series. For context and reference please refer to previous parts [ 1 , 2 , 3 , 4 ].

Matplotlib’s subplot function is a powerful tool for creating multiple plots in the same figure. It’s especially useful when you want to compare different views of your data side by side. Scatter plots, in particular, are a great way to visualize complex correlations between two variables.

It is important here to clearly understand the difference between using subplot , scatterplot and scatter matrices.

Simple subplots :

As we saw in part 4 of this series, we can get scatterplots by using the plot() or the subplot() function. Taking examples from this ,

First we used plot function with kind argument as ‘scatter’ which results in scatterplot .

Then we also used plt.subplot(), with ax.scatter to specify that we want a scatterplot as the subplot here :

subplots() without arguments returns a Figure and a single Axes.

This is actually the simplest and recommended way of creating a single Figure and Axes. more on this here.

Subplots with arguments :

Next, what we will see here is how we can use subplot with its own arguments to get a collection of plots all as part of a single figure.

For instance, suppose we wish to examine whether customers who live closer to stores spend more in store, and whether those who live further away spend more online. Those involve different spending variables and thus need separate plots. If we plot several such things individually, we end up with many individual charts. We shall see how .

plt.subplot(221)
plt.scatter(x=cust_df.distance_to_store, y=cust_df.store_spend, c='none', edgecolor='darkblue', s=8)
plt.title('store')
plt.ylabel('Prior 12 months in-store sales ($)')
plt.subplot(223)
plt.scatter(x=cust_df.distance_to_store, y=cust_df.online_spend, c='none', edgecolor='darkblue', s=8)
plt.title('online')
plt.xlabel('Distance to store')
plt.ylabel('Prior 12 months online sales ($)')
plt.subplot(222)
plt.scatter(x=cust_df.distance_to_store, y=cust_df.store_spend+1, c='none', edgecolor='darkblue', s=8)
plt.title('store, log')
plt.xscale('log')
plt.yscale('log')
plt.subplot(224)
plt.scatter(x=cust_df.distance_to_store, y=cust_df.online_spend+1, c='none', edgecolor='darkblue', s=8)
plt.title('online, log')
plt.xlabel('Distance to store')
plt.xscale('log')
plt.yscale('log')
plt.tight_layout()

# Instead of four separate plots from the individual plot() or scatter() commands,
# this code produces a single graphic with four panels
A single graphic object consisting of multiple plots shows that distance to store is related to in-store spending, but seems to be unrelated to online spending. The relationships are easier to see when spending and distance are plotted on a log scale in the two right panels.

Code Explanation :
1. Prior to each plotting command, we specify the subplot in which we want that plot to appear.
2. The argument to subplot is of the form rows, columns, index .
3. The index is numbered from left to right and top to bottom. In this case we wanted two rows and two columns. We can select the upper left panel in such an arrangement using plt.subplot(221) or, equivalently, plt.subplot(2, 2, 1).
4. The upper right has the index 2, the lower left 3, and lower right 4.
5. plt.tight_layout() adjusts the spacing so that all labels are visible.

Observation from the graph :
we see in the upper right panel that there may be a negative relationship between customers’ distances to the nearest store and in-store spending. Customers who live further from their nearest store spend less in store. However, on the lower right, we don’t see an obvious relationship between distance and online spending.

Scatterplot Matrices

When you have several variables in the dataset, it is good practice to examine scatterplots between all pairs of variables before moving on to more complex analyses.
When performing EDA on a dataset, it is important to visualize correlations. Scatter matrix and heat maps are two of the best ways to achieve this. We shall see Scatter Matrix in this story and Heatmaps in later story of this series.

In our customer data, we have a number of variables that might be associated with each other; age, distance_to_store, and email all might be related to online and offline transactions and to spending.

Pandas provides the convenient function pandas.plotting.scatter_matrix(dataframe), which makes a separate scatterplot for every combination of variables:

_ = pd.plotting.scatter_matrix(cust_df, figsize=(15,15), c='none', edgecolor='darkblue') 

Code Explanation :
1. scatter_matrix() will produce output given just a dataframe of numeric data, but there are a variety of optional arguments, such as figsize, which we use here to set the size of the figure.
2. It will accept any arguments that the matplotlib scatter() function does. In this case, we specified the c and edgecolor parameters to create unfilled, dark blue markers.
3. We have set the call equal to ‘_’, which is used as a placeholder in Python when we know we will not use the output or,
in this case, to suppress the automatic printing of the returned object. This is used often when a function will return multiple
objects, but we only care about a subset of those.

Resulting Plot :
1. Each position in this matrix shows a scatterplot between two variables, except along the diagonal which has a histogram for each variable.
2. The diagonal argument allows selection of either ’hist’, for a histogram as we have here, or ’kde’ for a kernel density estimation plot in panels along the diagonal.
3. In the fourth row and fifth column we see a strong linear association between online_visits and online_trans.[customers who visit the website more frequently make more online transactions.]
4. we also see that customers with a higher number of online transactions have higher total online spending.
5. customers with more in-store transactions also spend more in-store.

This simple command produced a lot of information to consider.

Scatter Matrices with specific columns from the dataframe.

We will next look at pairgrids which are highly customized matrices of plots in part 6.

--

--