Understanding Dynamic Pricing Using Basic Statistics

Ogulcan Ertunc

Published in

Analytics Vidhya

7 min readApr 26, 2021

Don’t explain to me, show the code

You can access the GitHub repo here.

First of all, What is Dynamic Pricing?

By definition, Dynamic pricing is a pricing strategy in which prices change in response to real-time supply and demand. However, when we think about it more deeply, it is the pricing form of the product or service made by taking into account the different parameters in the market. To give a parameter to this pricing method, which we can see very clearly in airline ticket prices in daily life, seasonality and departure time of the plane, the percentage of the reservation that resulted in the check-in, and the number of seats left on the plane are among these parameters.

So we look at the conditions to determine the price. In high times, you want to keep the prices high to make more money because people will want to buy that product/service regardless of their area. During off-peak times, you want to lower the price to stimulate demand. This is not a difficult concept. The challenge is knowing how high and how low to change prices before you see diminishing returns.

If we consider dynamic pricing as a system, it would be best to find the ideal price for situations that can increase profit return, maximize revenue, and use algorithms that make it difficult to make predictions for all this pricing.

Traditional Pricing

How is the traditional pricing done, which is now used other than dynamic pricing?

Pricing typically uses past performance compiled from reviews over a period of time (such as 3–6 months). This is a non-rapid process that relies heavily on indicators that may or may not be accurate in the future. If market conditions cause pricing errors, you are likely to lose revenue and won’t be able to turn around quickly.

When you need to make a turn, many businesses rely on gut feelings or other emotional responses to handle changes. While this may work for those with a lot of experience in the industry, overall, it does not provide a long-term effective strategy.

So how do we deal with the confidence interval with some statistical knowledge?

In dynamic pricing, we can think of the confidence interval as a range of two numbers that can include the estimated value of the population parameter.

There are a few simple steps we should follow when calculating confidence intervals:

Finding n, mean and standard deviation
Deciding the confidence interval (such as 95%, 99%)
Calculating the value from the Z table
Calculating the confidence interval using the previous steps

Now let’s consider a simple business problem with this knowledge:

Our business problem:

A game company gave gift coins to its users for the purchase of items in a game.
Using these virtual coins, users buy various vehicles for their characters.
The game company did not specify a price for an item and provided users to buy this item at the price they wanted.
For example, for the item named shield, users will buy this shield by paying the amounts they see fit.
In other words, one user can pay with 30 units of virtual money given to him, and the other user can pay with 45 units.
Therefore, users can buy this item with the amounts they can afford to pay.

What needs to be solved:

Does the price of the item differ by category? Express it statistically.
Depending on the first question, what should the item cost? Explain why?
It is desirable to be “movable” regarding price. Decision support for pricing strategy.
Simulate item purchases and income for possible price changes.

1.Importing Required Libraries, Functions, and Preparing Data

import pandas as pd
import itertools
import statsmodels.stats.api as sms
from scipy.stats import shapiro
import scipy.stats as stats
pd.set_option('display.max_columns', None)

# useful functions #
def replace_with_thresholds(dataframe, col_name):
def outlier_thresholds(dataframe, col_name):
def check_df(dataframe):

# reading the data
df = pd.read_csv("Medium_article/pricing.csv", sep=";")
df.head()
df.isna().sum()

We saw that we have category_id and price in our variables. It is also nice that there is no empty cell in our data.

When we try to see details of our dataset with check_df function, we can see there is a huge difference between 95% and 99% level in price.

I thought this situation might be in the original structure of the data, but if there are outlier data, I tried to set some of the data as an outlier and delete it from the data.

# Threshold values are determined for the price variable.
low, up = outlier_thresholds(df,"price")
print(f'Low Limit: {low}  Up Limit: {up}')# Outlier values need to remove.
def has_outliers(dataframe, numeric_columns):
    for col in numeric_columns:
        low_limit, up_limit = outlier_thresholds(dataframe, col)
        if dataframe[(dataframe[col] > up_limit) | (dataframe[col] < low_limit)].any(axis=None):
            number_of_outliers = dataframe[(dataframe[col] > up_limit) | (dataframe[col] < low_limit)].shape[0]
            print(col, ":", number_of_outliers, "outliers")

has_outliers(df, ["price"])

def remove_outliers(dataframe, numeric_columns):
    for variable in numeric_columns:
        low_limit, up_limit = outlier_thresholds(dataframe, variable)
        dataframe_without_outliers = dataframe[~((dataframe[variable] < low_limit) | (dataframe[variable] > up_limit))]
    return dataframe_without_outliers

df = remove_outliers(df, ["price"])

check_df(df)
df.groupby("category_id").agg({"price": "mean"}).reset_index()

It seems that there wasn’t much of a difference, that's why we can try to apply A/B testing in there.

2. AB Test

# 1.Checking Assumptions
# 1.1 Normal Distribution
# 1.2 Homogeneity of Variance

# 1.1 Normal Distribution
# H0: There is no statistically significant difference between sample distribution and theoretical normal distribution
# H1: There is statistically significant difference between sample distribution and theoretical normal distribution

print("Shapiro Wilks Test Result \n")
for x in df["category_id"].unique():
    test_statistic, pvalue = shapiro(df.loc[df["category_id"] == x, "price"])
    if (pvalue<0.05):
        print(f'{x}:')
        print('Test statistic = %.4f, p-value = %.4f' % (test_statistic, pvalue), "H0 is rejected")
    else:
        print(f'{x}:')
        print('Test statistic = %.4f, p-value = %.4f' % (test_statistic, pvalue), "H0 is not rejected")

The normal distribution is not provided, so we can apply a non-parametric method.

# 2.Implementing Hypothesis
groups = []
for x in itertools.combinations(df["category_id"].unique(),2):
    groups.append(x)

result = []
print("Mann-Whitney U Test Result ")
for x in groups:
    test_statistic, pvalue = stats.stats.mannwhitneyu(df.loc[df["category_id"] == x[0], "price"],
                                                      df.loc[df["category_id"] == x[1], "price"])
    if (pvalue<0.05):
        result.append((x[0], x[1], "H0 is rejected"))
        print('\n', "{0} - {1} ".format(x[0], x[1]))
        print('Test statistic= %.4f, p-value= %.4f' % (test_statistic, pvalue), "H0 is rejected")
    else:
        result.append((x[0], x[1], "H0 is not rejected"))
        print('\n', "{0} - {1} ".format(x[0], x[1]))
        print('Test statistic= %.4f, p-value= %.4f' % (test_statistic, pvalue), "H0 is not rejected")

result_df = pd.DataFrame()
result_df["Category 1"] = [x[0] for x in result]
result_df["Category 2"] = [x[1] for x in result]
result_df["H0"] = [x[2] for x in result]
result_df

3. Problems

# Does the price of the item differ by category?
result_df[result_df["H0"] == "H0 is not rejected"]

There is no statistically significant difference in average price between the 5 categorical groups

result_df[result_df["H0"] == "H0 is rejected"]

There is a statistically significant difference in average price between 10 categorical groups

# What should the item cost?
# The average of 4 statistically identical categories will be the price we will determine.
signif_cat = [361254, 874521, 675201, 201436]
sum = 0
for i in signif_cat:
    sum += df.loc[df["category_id"] == i,  "price"].mean()
PRICE = sum / 4

print("PRICE : %.4f" % PRICE)

# Flexible Price Range
# We list the prices of the 4 categories that selected for pricing
prices = []
for category in signif_cat:
    for i in df.loc[df["category_id"]== category,"price"]:
        prices.append(i)

print(f'Flexible Price Range: {sms.DescrStatsW(prices).tconfint_mean()}')

Simulation for Item Purchases

We will calculate the incomes that can be obtained from the maximum, minimum values of the confidence interval and the prices we set.

For minimum price in the confidence interval

# Simulation 
# 1- Price:36.7109597897918
# for minimum price in confidence interval
freq = len(df[df["price"] >= 36.7109597897918])
# number of sales equal to or greater than this price
income = freq * 36.7109597897918
print(f'Income: {income}')

2. For the decided price

# Price:37.0924
freq = len(df[df["price"] >= 37.0924])
# number of sales equal to or greater than this price
income = freq * 37.0924
print(f'Income: {income}')

3. For maximum price in the confidence interval

# Price:38.17576299427283
freq = len(df[df["price"] >= 38.17576299427283])
# number of sales equal to or greater than this price
income = freq * 38.17576299427283
print(f'Income: {income}')