Understanding Marketing Analytics in Python. [Part 4]Analyzing Relationships Between Variables — with example and code.

Kamna Sinha
Data At The Core !
Published in
10 min readSep 20, 2023

--

This is part 4 of the series on Marketing Analytics, have a look at the entire series introduction with details of each part here.

We shall now go ahead in analyzing our data at hand which we have created and done initial analysis in previous parts of this series :

part 1 — is about simulating store data using pandas.

part 2 — is about summarizing and inspecting variables and the entire dataframe.

part 3 — we have looked into single variable visualization.

In this story, we shall go ahead and first create Retailer data for store visits, both online and in-store and then explore associations between variables with Scatterplots and Histograms.

More valuable insight emerges when we understand relationships such as “Customers who live closer to our store visit more often than those who live farther away,” or “Customers of our online shop buy as much in person at the retail shop as do customers who do not purchase online.”

Identifying these kinds of relationships helps marketers understand how to reach customers more effectively.

We focus on understanding the relationships between pairs of variables in multivariate data, and examine how to visualize the relationships and compute statistics that describe their associations (correlation coefficients).

These are the most important ways to assess relationships between continuous variables.

I. Creating Retailer data

We simulate a dataset that describes 1000 customers of a multi-channel (stores and online) retailer and their transactions for 1 year. Within this data there will also be a subset of customers for whom we have survey data on product satisfaction.

Explanation of the above code :

  1. AGE : The customers’ ages in years (age) are drawn from a normal distribution with mean 35 and standard deviation 5 using numpy.random.normal(loc, scale, size).
  2. CREDIT SCORE : Credit scores (credit_score) are also simulated with a normal distribution, but in that case we specify that the mean of the distribution is related to the customer’s age, with older customers having higher credit scores on average.
  3. EMAIL : We create a variable (email) indicating whether the customer has an email on file,using the numpy.random.choice()
  4. basic CRM data is distance_to_store, in miles, which we assume follows the exponential of the normal distribution. That gives distances that are all positive, with many distances that are relatively close to the nearest store and fewer that are far from a store. [ As shown in the histogram below]

II. Simulating Online and In-store Sales Data

What we need : 1 year totals for each customer for online visits and transactions, plus total spending. To model counts of events over time, We simulate the number of visits with a negative binomial distribution.

To understand this refer to this video :

Like the lognormal distribution, the negative binomial distribution generates positive values and has a long right-hand tail, meaning that in our data most customers make relatively few visits and a few customers make many visits. Data from the negative binomial distribution can be generated using numpy.random.negative_binomial(n, p, size):

Explanation of the above code :

  1. takes n and p as shape parameters, where n is the target number of successes, sometimes referred to as the dispersion parameter as it sets the degree of dispersion in the samples, and p is the probability of a single success.
  2. Calculating probability : We model the mean (mu) of the negative binomial with a baseline value of 15. We add an average 15 online visits for customers who have an email on file ((cust_df.email == ’yes’) * 15). Finally, we add or subtract visits from the target mean based on the customer’s age relative to the sample median; customers who are younger are simulated to make more online visits. We then calculate prob using mu and n.

3. For each online visit that a customer makes, we assume there is a 30% chance of placing an order and use numpy.random.binomial() to create the variable online_trans

4. We assume that amounts spent in those orders (the variable online_spend) are lognormally distributed

5. The random value for amount spent per transaction — sampled with numpy.exp(numpy.random.normal()) is multiplied by the variable for number of transactions to get the total amount spent.

Note : most customers who visit a physical store make a purchase and even if customers did visit without buying, the company probably couldn’t track the visit. We assume that transactions follow a negative binomial distribution, with lower average numbers of visits for customers who live farther away. We model in-store spending as a lognormally distributed variable simply multiplied by the number of transactions:

III. Simulating Satisfaction Survey Responses

Our last simulation step is to create survey data for a subset of the customers. we assume that each customer has an unobserved overall satisfaction with the brand. We generate this overall satisfaction from a normal distribution.

The survey collects information on two items: satisfaction with service, and satisfaction with the selection of products.
We assume that customers’ responses to the survey items are based on unobserved levels of satisfaction overall (sometimes called the “halo” in survey response) plus the specific levels of satisfaction with the service and product selection.

To create such a score from a halo variable, we add sat_overall (the halo) to a random value specific to the item, drawn using numpy.random.normal().
Because survey responses are typically given on a discrete, ordinal scale (i.e., “very unsatisfied”, “unsatisfied”, etc.), we convert our continuous random values to discrete integers using the numpy.floor() function.

Now add these column to the cust_df dataframe:

IV. Exploring Associations between Variables with Scatterplots

We first review the structure of the given dataframe cust_df using head() command to begin our analysis :

Details of this data structure is important to note for further analysis :
Each row represents a different customer. For each,
1. there is a flag indicating whether the customer has an email address on file (email),
2. along with the customer’s age, credit_score, and distance to the nearest physical store (distance_to_store).
3. Additional variables report 1-year total visits to the online site (online_visits),
4. online and in-store transaction counts (online_trans and store_trans),
5. 1-year total spending online and in store (online_spend and store.spend).
6. survey ratings of satisfaction with the service and product selection at the retail stores (sat_service and sat_selection). Some of the survey values are NaN for customers without survey responses.

All values are numeric, except that cust_df.cust_id and cust_df.email are factors (categorical).

IV. a. Creating a Basic Scatterplot with plot() :

We Explore 3 Relationships :

  1. Between each customer’s age and credit score.
  2. Do customers who buy more online buy less in stores?
  3. Is the propensity to buy online vs in-store is related to email efforts ?

We begin by exploring the relationship between each customer’s age and credit score —
Using the plot() dataframe method, which is a wrapper function for a variety of plot types using matplotlib :

Simple scatterplot

About the scatterplot : a fairly typical scatterplot, There is a large mass of customers in the center of the plot with age around 35 and credit score around 725, and fewer customers at the margins.
Observation : There are not many younger customers with very high credit scores, nor older customers with very low scores, which suggests an association between age and credit score.

Putting more information in the plot :
modify the figure to make it more interpretable —
Given the high density of points at the center of the figure,
1. remove the fill using c=’none’, and
2. specified a color for the edge using edgecolor=’darkblue’.
3. xlim and ylim set a range for each axis.
4. plt.title(), plt.xlabel() and plt.ylabel() provide a descriptive title and axis labels for the chart.

5. Adding lines at mean values of both axes :
— to indicate the average age and average credit score in the data using the basic plt.plot() function —

a. add a horizontal line at cust_df.credit_score.mean() by specifying the x values to match the x-limits, and the y values to be equal to the mean credit score
b. For a vertical line at the mean age we do the same, but with the mean age as the x values and the y values set to match the y-limits.
c. ’k:’ specifies that we want this to be a black dotted line.

Next comes the marketing question : relationship 2 :

do customers who buy more online buy less in stores?

We start by plotting online sales against in-store sales.

Resulting plot :
typical of the skewed distributions that are common in behavioral data such as sales or transaction counts;
— most customers rarely make a purchase so the data are dense near zero.
The resulting plot has a lot of points along the axes; we use the s=8 argument, which scales down the plotted points so that we can see the points a bit more clearly (the argument specifies the size in points squared).

The plot shows that there are a large number of customers who didn’t buy anything on one of the two channels (the points along the axes),
AND a smaller number of customers who purchase fairly large amounts on one of the channels.

Further Investigation required :
Because of the skewed data, the plot does not yet give a good answer to our question about the relationship between online and in-store sales. We investigate further with a histogram of just the in-store sales.

observations :
1. large number of customers bought nothing in the online store
2. The distribution of sales among those who do buy has a mode around $20 and a long right-hand tail with a few customers whose 12 month spending was high. Such distributions are typical of spending and transaction counts in customer data.

NOTE :
Scatterplots and Histograms are complementary to each other ..
Scatterplots reveal relationships between two variables, but do not perform well when many values are very similar and overlay each other, as in this case. We can use a histogram to better visualize the actual density of points in those regions.

IV. b. Color-Coding Points on a Scatterplot

Next relationship we try to find a connection between sale values of customers who receive emails vs those who dont .

We can add the email dimension to the plot → by coloring in the points for customers whose email address is known to us.
we use groupby() along with mapping dictionaries specifying the color for each email category — yes v/s no.
We iterate through the groups and use the scatter() function to plot each subset.

We see that it is still difficult to see whether there is a different relationship between in-store and online purchases for those with and without emails on file, because of the heavy skew in sales figures.

IV. c. Plotting on a Log Scale

A common solution for such scatterplots with skewed data is to plot the data on a logarithmic scale

For cust_df, because both online and in-store sales are skewed, we use a log scale for both axes:

Observation from the plot :
1. a large number of customers with no sales (the points at x = 1 or y = 1, which correspond to zero sales because we added 1).
2. It now appears that there is little or no association between online and in-store sales;
3. the scatterplot among customers who purchase in both channels shows no pattern. Thus, there is no evidence here to suggest that online sales have cannabalized in-store sales.
4. customers with no email address on file appear to show slightly lower online sales than those with addresses; there are somewhat more black circles in the lower half of the plot than the upper half.
5. If we have been sending email promotions to customers, then this suggests that the promotions might be working. An experiment to confirm that hypothesis could be an appropriate next step.

What we were doing till now was to guess certain relationships exist and try to analyze their strength with the help of code and visualzation.

Going forward , we will see how instead of plotting several things individually and ending up with many individual charts, we can create a single graphic that consists of multiple plots in part 5.

Ref :

--

--