The Multiple Comparisons Problem

Using Northwind DataBase to understand the problem and how to avoid it

Published in

Analytics Vidhya

5 min readDec 14, 2019

Imagine the following scenario: you are a data scientist who is asked to assist a company gain better insight into different aspects of their business. This can include employee performance, product sales, and customer demographics. Your job is to perform experiments, learn as much as you can from the results and make business recommendations.

Usually, this is done by defining several useful metrics or segmentation. Both strategies help you learn more about the business. In the case of segmentation, you may run statistical analyses to discover significant findings. This means digging deeper to gain better insights. However, this comes at a high cost because the more comparisons you make, the more likely it is to draw an incorrect conclusion. This is a common mistake known as the multiple comparisons problem. In this post, I will highlight how this problem can affect your statistical analyses and a technique to correct it.

Northwind DataBase

For this project, I worked with the Northwind Company database to derive some valuable business insights. The Northwind database is a sample database created by Microsoft for a fictitious company that specializes in food and beverage distribution. The database contains a range of information on the company’s activities including its customers, their products, their orders, and customer demographics. I was interested in gaining an understanding of how discounts affected orders. So, one of the questions I decided to ask was:

How do discounts affect order quantities?

This question highlights an example of multiple comparisons because you need to compare the different discount levels and perform the same analysis several times.

Stating the Hypothesis

To answer this question, I first need to create two hypotheses: the null hypothesis and the alternative hypothesis.

Null hypothesis: the discount level does not affect the number of orders made

Alternative hypothesis: the discount level does affect the number of orders made (increases order quantities or decreases them)

The alpha level (the probability of rejecting the null hypothesis when it is true) was set to 0.05.

Obtaining Data

Now that we have set the hypotheses and the alpha value, we can extract the relevant data from the schema below:

https://raw.githubusercontent.com/learn-co-students/dsc-mod-3-project-online-ds-pt-041519/master/Northwind_ERD_updated.png

To answer the question, I needed to extract the product id, discounts, unit prices, and quantities from the order detail table from the schema above

#import relevant libraries
import sqlite3
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.dates import DateFormatter
import matplotlib.dates as mdates
from statsmodels.graphics.factorplots import interaction_plot
import seaborn as sns
import scipy.stats as stats
import statsmodels.api as sm
from statsmodels.formula.api import ols
from statsmodels.stats.multicomp import pairwise_tukeyhsd
from statsmodels.stats.multicomp import MultiComparison
from statsmodels.stats.power import TTestIndPower, TTestPower
import warnings
warnings.filterwarnings(‘ignore’)
import scipy as sp#connect to database
conn = sqlite3.connect('Northwind_small.sqlite')
c = conn.cursor()#create list of tables
c.execute("""SELECT name from sqlite_master WHERE type = 'table';""")
tables = c.fetchall()
tables = [i[0] for i in tables]c.execute("""SELECT ProductId, UnitPrice, Quantity, Discount FROM OrderDetail""")
discounts = pd.DataFrame(c.fetchall())
discounts.columns = [column[0] for column in c.description]
discounts.head()

Next, I needed to check for the presence of a discount:

query = 'DiscBool != 0' #removing the non-discounted orders
discounted = discounts.query(query)
discounted.head(3)

Because the question pertains to the discount level, I check the number of discount levels. On close examination, I found most of the discounts were in intervals of 0.05 from 0 to 25. These order quantities were kept and the other smaller discount levels were dropped.

Exploring the Data

Now, that the relevant data has been extracted and grouped correctly, I can start exploring it. The code block below shows graphs that plots the number of orders and average order size between discounted orders and orders that were not discounted:

#Viewing difference between discount levels against number of orders and average order size visually
#fig, axes = plt.subplot(1, 2, figsize = (12,5))
fig, axes = plt.subplots(2, 1, figsize=(12,5))
sns.set_style('whitegrid')
#Grouping dataframe
discounted.groupby(['Discount'])['Quantity'].count().plot(kind = 'barh', ax = axes[0], color=['#E78AC3', '#A6D854', '#FFD92F', '#E5C494', '#B3B3B3', '#66C2A5'])
discounted.groupby(['Discount'])['Quantity'].mean().plot(kind = 'barh', ax = axes[1], color=['#E78AC3', '#A6D854', '#FFD92F', '#E5C494', '#B3B3B3', '#66C2A5'])
#Plot 1 labels
axes[0].set_title('Number of Orders per Discount Level', fontweight = 'bold', fontsize = 13)
axes[0].set_xlabel('Order Size', fontweight = 'semibold')
axes[0].set_ylabel('')
axes[0].set_yticklabels(['0.05', '0.10', '0.15', '0.20', '0.25'])#Plot 2 labels
axes[1].set_title('Average Order Size by Discount Level', fontweight = 'bold', fontsize = 13)
axes[1].set_xlabel('Average Order Size', fontweight = 'semibold')
axes[1].set_ylabel('')
axes[1].set_yticklabels(['0.05', '0.10', '0.15', '0.20', '0.25'])
fig.subplots_adjust(hspace = .75);
fig.savefig('discountlevels.png', bbox_inches = 'tight')

Figure 2: Graphs for different discount levels

The graphs show more items that aren’t discounted have a higher number of orders but discounted items have a larger average order size.

Hypothesis Testing

To answer the question of whether discounts affect order quantities, ANOVA was used. ANOVA or analysis of variance generates a statistical analysis for multiple groups.

#fitting using statsmodels ols
lm = ols('Quantity ~ C(Discount)', discounted).fit()
lm.summary()

Figure 3: ANOVA table for discount levels

The table above shows the p-values for the various discount levels. The next step would be performing a Tukey test. Tukey test is a single step statistic test used to find means that are significantly different from each other. For multiple comparisons, we will need to create data frames for the varying discount levels.

#ANOVA analysis to test multiple comparisons test
sm.stats.anova_lm(lm, typ = 2)

The resulting table gives us all the combination pairs of discount levels to compare, with the mean difference and whether this mean difference rejected the null hypothesis or failed to reject the null hypothesis. In the case of the Northwind project, I found that for the most part, the discount level that the product was priced at didn’t affect order quantities. There is a possibility that these discounts were negatively affecting revenues.

Conclusion

The multiple comparison problem is an important issue to keep in mind when performing statistical analysis or examining studies. Because if we look hard enough, we may see correlations where they do not exist. Humans are very adept at finding patterns and can make the mistake of finding meaning in what is random noise. From a business standpoint, being aware of this means analyzing claims more closely and making more rational decisions.

The Multiple Comparisons Problem

Using Northwind DataBase to understand the problem and how to avoid it

How do discounts affect order quantities?

Written by Hazel Donaldson