Understanding Marketing Analytics in Python. [Part 2] Summarizing Data — variables and Dataframes.With example and code.
This is part 2 of the series on Marketing Analytics, have a look at the entire series introduction with details of each part here.
Observations may comprise either discrete data that occurs at specific levels or continuous data with many possible values.
Continuing from Part 1 of this series, we will now explore the data that we created in previous part of this series.
- The groupby() function — usecase 1, 2, 3
- Discrete Variables — usecase 4, 5, 6, 7
- Continuous Variables — usecase 8,9,10,11
- Summarizing Dataframes — using describe() and apply() — usecase 12,13,14,15,16,17 , 18
lets start by looking at the sample of our created dataframe with its rows and columns :
The groupby() function
groupby() is a method on pandas dataframes. The by argument specifies the column by which to group, for example store_num:
That function returns a SeriesGroupBy object, which can be saved to a variable or acted upon directly. That object contains each other column within the dataframe, which can be accessed via dot notation.
Pandas analytical methods can then be applied to that group, such as mean(), sum(), etc.
Usecase 1 : calculating mean sales by store :
we group by store number , as had been assigned by us when we were creating store data and find out the mean of sales for product p1 for each store separately.
Usecase 2: calculating mean by more than one factor [ Multilevel groupings] :
To group it by more than one factor, use a list of factors.
Using unstack() for multilevel groupings : for better formatted result which can be saved into another object.
Usecase 3: computing the total (sum()) sales of P1 by country:
Discrete Variables
A basic way to describe discrete data is with frequency counts.
The value_counts() method will count the observed prevalence of each value that occurs in a variable.
Usecase 4: count how many times product 1 was observed to be on sale at each price point:
We can even visualize the above using a bar plot [ more on visualization in later parts of this series]
Usecase 5: how often each product was promoted at each price point
The pandas.crosstab() function produces counts of observations at each level for two variables, i.e. a two-way cross tab:
Usecase 6: mean price by product, promotion status, and country:
usecase 7 : to find out % times a product was promoted at each price point
we can compute the exact fraction of times product 1 is on promotion at each price point if we assign the result to a variable and then divide it by the total sales by price in p1_table_0: [ expected answer is 10% because thats what we introduced when we created this data]
The div() method applies element-wise division between the series of counts in p1_table_0 and each row of p1_table_1
Continuous Variables
usecase 8 : find all distribution values in case of continuous variables
usecase 9 : finding other quantiles
Change the q= argument in quantile() to find other quantiles:
for eg for every 10th percentile, we put the numbers as a range using arange function we find every 10th percentile by creating a sequence using numpy.arange(start, stop, step) to yield the vector 0, 0.1, 0.2 … 1.0.
usecase 10 : IQR
For skewed and asymmetric distributions that are common in marketing, such as unit sales or household income, the arithmetic mean() and standard deviation std() may be misleading; in those cases, the median() and interquartile range (IQR, the range of the middle 50% of data) are often more useful to summarize a distribution. Pandas does not have a
built-in IQR function, but we can create one and apply it to our data:
usecase 11 :
assemble these summary statistics into a dataframe that is easier to read than the one-line-at-a-time output above.
pd.DataFrame([[store_sales.p1_sales.median(),store_sales.p2_sales.median()],
[iqr(store_sales.p1_sales),iqr(store_sales.p2_sales)]],
index=['Median sales', 'IQR'],
columns=['p1_sales', 'p2_sales'])
#We name the columns and rows, and fill in the cells with function values:
Summarizing Dataframes
usecase 12 : preliminary inspection of a dataframe or other object — describe() function
marketing analytics tip :
how to figure our errors in data using describe :
1. Check the min and max for outliers or miskeyed data
2. the mean and 50% (median) are reasonable and similar to one another[i.e. if thats expected]
usecase 13 : finding the mean on all numeric columns (columns 3–8): — using .iloc()
usecase 14 : use the axis argument to run the function across rows rather than columns (the default is columns: axis=0):
usecase 15 : applying our own defined functions on dataframe columns [ eg iqr() function defined earlier] — using apply() method.
apply — runs any function that you specify on each of the rows (when axis=1) and/or columns (the default, or when axis=0) of a dataframe.
usecase 16 : anonymous functions or lambda functions used on dataframes
This analysis shows that the mean of p1_sales and the mean of p2_sales are larger than the median by about three sales per week, which suggests there is a right-hand tail to the distribution. That is, there are some weeks with very high sales that pull the mean up.
usecase 17 : apply() on series objects
where the function is applied element-wise rather than to the series as a whole: [ we are categorizing sales value as ‘high’ or ‘low’ using condition within the lambda function.
usecase 18 : using apply with iqr function
pd.DataFrame([store_sales[['p1_sales', 'p2_sales']].median(),
store_sales[['p1_sales', 'p2_sales']].apply(iqr)],
index=['Median sales', 'IQR'])
This concludes our most commonly used functions and usecases with examples for inspecting and summarizing data before moving ahead with analysis.
We will next see visualization of the given dataset using matplotlib and other libraries in coming part of the series in Part 3.