Hypothesis Testing with the Northwind Database

Allison Kelly
Analytics Vidhya
Published in
4 min readSep 2, 2019

Imagine you work for Northwind Traders, a fictitious international specialty food supplier, as a stockist. It is your job to make sure that you have enough stock available to fulfill each order and that products do not sit on your shelves for too long, possibly spoiling or going stale. Running out of products could cause your clients to switch to a competitor, and food closing in on its expiration date will be put on sale, initiating a lower profit margin for Northwind.

Luckily, you have three years of order history at your disposal in your database, so it’s time to get to work.

First, we must define our hypotheses.

𝐻₀ The day of the week has no effect on order quantity.
H₁ The day of the week will cause a difference in order quantity.

We will attempt to reject the null hypothesis that order quantity does not differ day to day. This is a two-tailed T-test that will test whether the sample mean is significantly less OR more than the population mean. With a significance level of .05, the sample mean must be in the top or bottom 2.5% of the probability distribution.

Now, let’s get our data.

(You can download the Northwind Database here)

We’ll be using both SQLite3 to grab our data and Python to manipulate it. Here is the ERD for the Northwind Database:

We’re most interested in looking at the products that were ordered, the date of order, and the quantity of the order.

cur.execute("""Select od.ProductId, od.Quantity, o.OrderDate
FROM OrderDetail AS od
JOIN 'Order' as o
ON o.Id = od.OrderID;""")
order_df = pd.DataFrame(cur.fetchall())
order_df.columns = [i[0] for i in cur.description]
order_df.head()

As you can see, the OrderDate column does not include the day of the week. The code below changes the column from a string to a datetime format and adds a new column containing the day that the product was ordered.

order_df['OrderDate'] = pd.to_datetime(order_df.OrderDate)
order_df['day_of_week'] = order_df['OrderDate'].dt.day_name()

Viola! Now we have our days of the week. But before we get ahead of ourselves, we need to make sure the data follows a normal distribution. (Hint: it doesn’t.) The following visualizations show a before and after the Quantity column was log transformed with the following code:

order_df.Quantity = order_df.Quantity.apply(np.log)

Now that our data is normalized, we need to look at the averages of order quantity per day of week. To do this, we’ll create a pivot table and flatten it into a dataframe.

weekday_quantity_pivot = pd.pivot_table(order_df, 
index="ProductId",
columns='day_of_week',
values="Quantity",
fill_value=0.0,
margins=True)
weekday_quantity = pd.DataFrame(weekday_quantity_pivot.to_records())
weekday_quantity.drop(77, axis=0, inplace=True)
weekday_quantity.head()

The ‘All’ column shows the average order quantity across all orders, no matter which day, and luckily, as we conduct our analysis, the ‘All’ column will end up as our population mean.

Time to test!

We’ll conduct a one-sample T-test using stats.ttest_1samp. Because it is a two-tailed test (we want to know if you should order more or less for those days) the p-value will need to be halved.

pop_means = weekday_quantity.All # Population mean
alpha = .05 # Significance level
p_vals = []
d_o_w = []
significant_or_not = []
difference = []
for day in order_days:
day_order = order_df.loc[order_df["day_of_week"] == day] #sort by day
for i in np.arange(1,78): # Sort by product ID
pop_mean = pop_means[i-1] #Population mean series begins at 0
a = day_order.loc[day_order["ProductId"]==i].Quantity
difference.append(a.mean()-pop_mean)
T, two_tailed_p = stats.ttest_1samp(a=a, popmean=pop_mean)
p_vals.append(round(two_tailed_p,2))
d_o_w.append(day)
if alpha/2 > two_tailed_p:
significant_or_not.append(True)
else:
significant_or_not.append(False)

results_df = pd.DataFrame([p_vals, d_o_w, significant_or_not, difference]).transpose()
results_df.columns = ['PVal', "Day", "RejectNull", "Difference"]

After some reindexing and sorting by day, we have our results. This dataframe demonstrates ONLY the products on days that had a p-value less than our threshold, signifying statistical significance. The difference column is a difference of the sample mean minus the population mean. For example, product 2, the average order size is larger than the sample size, so as a stockist, you can reduce the amount in stock on Mondays to combat a surplus of product that may go bad, but the stock should increase for product 10 on Mondays, since the average order is less than the sample size.

And so…

With this information, you will secure your place in Northwind Traders history as the stockist that reduced waste and improved client relationships, increasing profit. But the work is not done yet. What about an increase in sales around the holidays? What about seasonally? What about the region in which the products are ordered? A data scientists work is never done.

--

--

Allison Kelly
Analytics Vidhya

Growth Strategy @ Botmock :: Combining the logic of data science to the fluidity of marketing