Understanding Marketing Analytics in Python. [Part 3] Single Variable Visualization — histograms, boxplots, maps and more. With example and code.

Kamna Sinha
Data At The Core !
Published in
7 min readSep 13, 2023

This is part 3 of the series on Marketing Analytics, have a look at the entire series introduction with details of each part here.

We shall continue our story from previous parts of the series : part 1 : we created our sales data set for 2 products using various techniques , and part 2 : was about summarizing data and preluminary inspection of the dataset.

We now move on to using the matplotlib library to study our data visually and find out more about the distribution, skewness etc before planning further analysis.

We shall look at the following usecases :

  1. Histograms [ using matplotlib] step by step analysis of given data using histogram visualization, by changing parameters and looking into more details with every step
  2. Boxplots : are a compact way to represent a distribution. We shall move from a simple boxplot to comparing sales of a product across stores using boxplots and then on to checking if promotions make a difference in sales for a specific product.
  3. QQ Plot to check Normality : a graphical method to evaluate a distribution more formally. We shall use this to check if sales data is a normal distribution and if not how to go about further analysis which assume normal distribution of the data.
  4. Cumulative distribution of sales data using empirical cumulative distribution function (ECDF).
  5. Plot our marketing data on the map using the cartopy library.

usecase 1 : Histograms

fig : A basic histogram using hist() We see that the weekly sales for product 1 range from a little less than 100 to a bit more than 250.

That plot was easy to make but the visual elements are less than pleasing, so we will improve it.we go through the intermediate steps here so you can see the process of how to evolve a graphic in Python.

This is improved but not perfect; it would be nice to have more granularity (more bars) in the histogram. Also, let’s tweak the appearance by removing the background as well as coloring and adding borders to the bars. And the function plt.box(False) removes the plot background and plt.grid(False) removes the grid. Also, We can set the font in the rcParams module.

Instead of using the default tick marks (axis numbers) for hist(), we can specify the x-axis number explicitly. The argument for relative frequency is density=True and the x-axis numbers are specified using the plt.xticks() function: With plt.xticks(), we have to tell it where to put the labels, which may be made with the range() function to generate a sequence of numbers.

Finally, we add a smoothed estimation line. To do this, we use the density() plotting method on the p1_sales series. The density plot disrupted the axis autoscaling, so we also use plt.xlim() to specify the x-axis range.

Figure above is now very informative. Even someone who is unfamiliar with the data can see that this plot describes weekly sales for product 1 and that the typical sales range from about 80 to 200.

usecase 2 : Boxplots

The pandas box() method is straightforward; we add labels, use the argument vert=False to rotate the plot 90 ◦ to look better, and use sym=’k.’ to specify the outlier marker:

The boxplot presents the distribution more compactly than a histogram. The median is the center line while the 25th and 75th percentiles define the box. The outer lines are whiskers at the points of the most extreme values that are no more than 1.5 times the width of the box away from the box. Points beyond the whiskers are outliers drawn as individual points. This is also known as a Tukey boxplot (after the statistician, Tukey) or as a box-andwhiskers plot.

Boxplots are even more useful when you compare distributions by some other factor. How do different stores compare on sales of product 2? The boxplot() method makes it easy to compare these with the by argument, which specifies the column by which to group. The column argument indicates the column represented by the boxplot distribution, p2_sales in this case. These correspond to the response variable p2_sales which we plot with regards to the explanatory variable store_num:

stores are roughly similar in sales of product 2

Note that plt.suptitle() removes the default title that the boxplot() method adds, as we’d prefer to specify a more informative title.

Our next analysis would be to check if promotion makes a difference in sales. In this case, our explanatory variable would be the promotion variable for P2, so we use boxplot() now replacing store_num with the promotion variable p2_promo.

There is a clear visual difference in sales on the basis of in-store promotion!

To wrap up: boxplots are powerful tools to visualize a distribution and make it easy to explore how an outcome variable is related to another factor.

usecase 3: QQ Plots

Quantile-quantile (QQ) plots are a good way to check one’s data against a distribution that you think it should come from. A QQ plot can confirm that the distribution is, in fact, normal by plotting the observed quantiles of your data against the quantiles that would be expected for a normal distribution.

To do this, we can use the probplot() function from the scipy.stats library, which compares data vs. a specified distribution, for example the normal distribution. We check p1_sales to see whether it is normally distributed:

The distribution of p1_sales is far from the line at the ends, suggesting that the data are not normally distributed. The upward curving shape is typical of data with high positive skew.

What should you do in this case? If you are using models or statistical functions that assume normally distributed data, you
might wish to transform your data. As we’ve already noted, a common pattern in marketing data is a logarithmic distribution.
We examine whether p1_sales is more approximately normal after a log() transform:

The points are much closer to the solid line, indicating that the distribution of log(store_sales.p1_sales) is more approximately normal than the untransformed variable.

usecase 4: Cumulative Distribution

We often use cumulative distribution plots both for data exploration and for presenting data to others. They are a good way to highlight data features such as discontinuities in the data, long tails, and specific points of interest.

Below we do a simple plot that shows the cumulative proportion of data values in our sample. This is an easy way to inspect a distribution and to read off percentile values. empirical cumulative distribution function (ECDF) — is simply a plot that shows the cumulative proportion of data values in your sample. This is an easy way to inspect a distribution and to read off percentile values.

We plot the ECDF of p1_sales by combining a few steps. We can use the ECDF() function from the statsmodels library to find the ECDF of the data. Then we put the results into plot(), adding options such as titles.

Example : Suppose we also want to know the value for which 90% of weekly sales of P1will be lower than that value, i.e., the 90th percentile for weekly sales of P1.

We can use plot() to add vertical and horizontal lines at the 90th percentile. We do not have to specify the exact value at which to draw a line for the 90th percentile; instead we use quantile( , pr=0.9) to find it. The ’k-’ positional argument indicates that we want the lines to be black and dashed and the alpha=0.5 sets the transparency level of the lines to 50%:

Output : Cumulative distribution plot with lines to emphasize the 90th percentile. The chart identifies that 90% of weekly sales are lower than or equal to 171 units. Other values are easy to read off the chart. For instance, in roughly 10% of weeks fewer than 100 units are sold, and in the upper 5% more than 200 units are sold.

usecase 5: Maps

We often need to plot marketing data on a map. A common variety is a choropleth map, which uses graphics or color to indicate values of a variable such as income or sales.We consider how to do this for a world map using the cartopy library.

output :

World map for P1 sales by country, using cartopy

This concludes our analysis using visualization.

Going forward, we shall look into relationships between continuous variables in part 4 of the series.

--

--