Understanding Marketing Analytics in Python. [Part 1] Simulating Data . With example and code.

Published in

Data At The Core !

5 min readSep 11, 2023

This is part 1of the series on Marketing Analytics, have a look at the entire series introduction with details of each part here.

When learning how to go about Data Analytics using Python, its best to use data which is known to us, does not change without our knowledge and can be altered as and when we want to see consequent results in analysis output.

Hence listed down are the steps to create your own simulated data with example and code.

About our Data

Our first dataset represents observations of total sales by week for two competing products at a chain of stores.

Creating empty data structure :

We begin by creating a data structure that will hold the data, a simulation of sales for the two products in 20 stores over 2 years, with price and promotion status.

Step1. Here we first set a few constants, the number of stores and the number of weeks of data for each store.

Step2. create empty dataframe with set shape

Step3. check the df thats been created

Creating Empty data structure with junk/zero values

Creating Store IDs

Step 4. we will create a set of store “numbers” or “ids,” which will serve to identify each store:

Assign a country to each store :

step5. we created a dictionary that maps from store number to the country of that store.

Fill the Dataframe values column by column :

step6. start filling in the store_sales dataframe: ‘store_num’ , ‘year’ , ‘week’ , ‘country’ columns.

step7. changing data types to categorical wherever needed

The types for all of the variables in our dataframe were dictated by the input data. country values have been stored as str.
However, country labels are actually discrete values and not just arbitrary text. So it is better to represent country explicitly as a categorical variable.

Similarly, store_num is a label, not a number as such.

By converting those variables to categorical types, they will be treated as a categorical in subsequent analyses such as regression models. It is good practice to set variable types correctly as they are created; this will help you to avoid errors later.

Storing Data Values in the data frame by siulating data points

step8. for 3 columns : sales, price, and promotional status : We complete store_sales with random data for store-by-week observations [ for the 2 competing products p1 and p2 ]

step8.a. set the random number generation seed

step8.b. drawing from the binomial distribution for promo column [ 0 or 1] — using random.binomial

step8.c. set price for p1 and p2 product columns- using random.choice

We suppose that each product is sold at one of five distinct price points ranging from $2.19 to $3.19 overall. We randomly draw a price for each week by defining a vector with the five price points and using np.random.choice(a, size, replace) to draw from it as many times as we have rows of data (size=n_rows). The five prices are sampled many times, so we sample with replacement (replace=True, which is the default so we don’t write it):

step8.d. simulate the sales figures for each week — using Poisson distribution

Item sales are in unit counts, so we use the Poisson distribution to generate count data: np.random.poisson(lam, size), where size is the number of draws and lam represents lambda, the defining parameter of the Poisson distribution.
Lambda represents the expected, or mean, value of units per week.
We draw a random Poisson count for each row (size=n_rows), and set the mean sales (lam) of Product 1 to be higher than that of Product 2:

next, we scale those counts up or down according to the relative prices. Price effects often follow a logarithmic function rather than a linear function, so we use np.log(price) here:

Assumptions :
1. We have assumed that sales vary as the inverse ratio of prices. That is, sales of Product 1 go up to the degree that the
log(price) of Product 1 is lower than the log(price) of Product 2.
2. we assume that sales get a 30 or 40% lift when each product is promoted in store.
We simply multiply the promotional status vector (which comprises all {0, 1} values) by 0.3 or 0.4 respectively, and then multiply the sales vector by that.

We use the floor() function to drop fractional values and ensure integer counts for weekly unit sales, and put those
values into the dataframe:

We now have our final data ready for analysis.

We shall look into subsequent steps in following parts of this series.

Part 2 shows summarizing and inspecting variables and entire dataframe with several functions and usecases with examples and code.

Understanding Marketing Analytics in Python. [Part 1] Simulating Data . With example and code.

Written by Kamna Sinha