This is part 2 of the series on Marketing Analytics, have a look at the entire series introduction with details of each part here.

Summarizing Data [ image — multiple sources ]

Observations may comprise either discrete data that occurs at specific levels or continuous data with many possible values.

Continuing from Part 1 of this series, we will now explore the data that we created in previous part of this series.

  1. The groupby() function — usecase 1, 2, 3
  2. Discrete Variables — usecase 4, 5, 6, 7
  3. Continuous Variables — usecase 8,9,10,11
  4. Summarizing Dataframes — using describe() and apply() — usecase 12,13,14,15,16,17 , 18

lets start by looking at the sample of our created dataframe with its rows and columns :

sample dataframe : store_sales

The groupby() function

groupby() is a method on pandas dataframes. The by argument specifies the column by which to group, for example store_num:

That function returns a SeriesGroupBy object, which can be saved to a variable or acted upon directly. That object contains each other column within the dataframe, which can be accessed via dot notation.
Pandas analytical methods can then be applied to that group, such as mean(), sum(), etc.

Usecase 1 : calculating mean sales by store :

we group by store number , as had been assigned by us when we were creating store data and find out the mean of sales for product p1 for each store separately.

Usecase 2: calculating mean by more than one factor [ Multilevel groupings] :

To group it by more than one factor, use a list of factors.

Using unstack() for multilevel groupings : for better formatted result which can be saved into another object.

Usecase 3: computing the total (sum()) sales of P1 by country:

Discrete Variables

A basic way to describe discrete data is with frequency counts.
The value_counts() method will count the observed prevalence of each value that occurs in a variable.

Usecase 4: count how many times product 1 was observed to be on sale at each price point:

value_counts method with and without normalize variable

We can even visualize the above using a bar plot [ more on visualization in later parts of this series]

Observation : product was on sale at each price point roughly the same number of times

Usecase 5: how often each product was promoted at each price point

The pandas.crosstab() function produces counts of observations at each level for two variables, i.e. a two-way cross tab:

Usecase 6: mean price by product, promotion status, and country:

usecase 7 : to find out % times a product was promoted at each price point

we can compute the exact fraction of times product 1 is on promotion at each price point if we assign the result to a variable and then divide it by the total sales by price in p1_table_0: [ expected answer is 10% because thats what we introduced when we created this data]
The div() method applies element-wise division between the series of counts in p1_table_0 and each row of p1_table_1

Continuous Variables

Distribution functions that operate on a numeric vector

usecase 8 : find all distribution values in case of continuous variables

min, max, mean, median and variance calculation examples
standard deviation, mean absolute deviation and quantile values example

usecase 9 : finding other quantiles

Change the q= argument in quantile() to find other quantiles:
for eg for every 10th percentile, we put the numbers as a range using arange function we find every 10th percentile by creating a sequence using numpy.arange(start, stop, step) to yield the vector 0, 0.1, 0.2 … 1.0.

arange function used instead of range since this returns decimal values, range only returns int values

usecase 10 : IQR

For skewed and asymmetric distributions that are common in marketing, such as unit sales or household income, the arithmetic mean() and standard deviation std() may be misleading; in those cases, the median() and interquartile range (IQR, the range of the middle 50% of data) are often more useful to summarize a distribution. Pandas does not have a
built-in IQR function, but we can create one and apply it to our data:

creating our own iqr function

usecase 11 :
assemble these summary statistics into a dataframe that is easier to read than the one-line-at-a-time output above.

pd.DataFrame([[store_sales.p1_sales.median(),store_sales.p2_sales.median()],
[iqr(store_sales.p1_sales),iqr(store_sales.p2_sales)]],
index=['Median sales', 'IQR'],
columns=['p1_sales', 'p2_sales'])
#We name the columns and rows, and fill in the cells with function values:

Summarizing Dataframes

usecase 12 : preliminary inspection of a dataframe or other object — describe() function

marketing analytics tip :

how to figure our errors in data using describe :
1. Check the min and max for outliers or miskeyed data
2. the mean and 50% (median) are reasonable and similar to one another[i.e. if thats expected]

usecase 13 : finding the mean on all numeric columns (columns 3–8): — using .iloc()

usecase 14 : use the axis argument to run the function across rows rather than columns (the default is columns: axis=0):

usecase 15 : applying our own defined functions on dataframe columns [ eg iqr() function defined earlier] — using apply() method.

apply — runs any function that you specify on each of the rows (when axis=1) and/or columns (the default, or when axis=0) of a dataframe.

apply() function

usecase 16 : anonymous functions or lambda functions used on dataframes

This analysis shows that the mean of p1_sales and the mean of p2_sales are larger than the median by about three sales per week, which suggests there is a right-hand tail to the distribution. That is, there are some weeks with very high sales that pull the mean up.

usecase 17 : apply() on series objects

where the function is applied element-wise rather than to the series as a whole: [ we are categorizing sales value as ‘high’ or ‘low’ using condition within the lambda function.

usecase 18 : using apply with iqr function

pd.DataFrame([store_sales[['p1_sales', 'p2_sales']].median(),
store_sales[['p1_sales', 'p2_sales']].apply(iqr)],
index=['Median sales', 'IQR'])

This concludes our most commonly used functions and usecases with examples for inspecting and summarizing data before moving ahead with analysis.

We will next see visualization of the given dataset using matplotlib and other libraries in coming part of the series in Part 3.

--

--