Hypothesis Testing using Northwind Database

Published in

Data By Matt

7 min readDec 2, 2018

For the Module 2 Project, I had to work with the Northwind database from Microsoft in order to test various hypotheses. This database has data on product sales of a fictitious specialty foods company. Before working on the hypotheses, I first had to import the database using SQL and then save each of the tables as a Pandas dataframe.

The starter question was,

“Does discount amount have a statistically significant effect on the quantity of product in an order? If so, at what level(s) of discount?”.

The first part of the question was relatively straightforward. I simply had to test the hypothesis that having any discount at all would have a significant effect on the quantity ordered. To do this, I had to use the ‘order_details’ table. I then created a new column that had a value of 1 if there was any discount applied and a 0 if there was no discount. And next I ran a two-tailed t-test and got a significant p-value as the result. This answered the first part of the question.

For the second part of the question I first looked at the value counts of the ‘discount’ column. I noticed that there were a couple discount values that had very few occurrences but otherwise the discounts fell into 6 buckets: 0, 0.05, 0.10, 0.15, 0.20, and 0.25. Since the discount values with low counts were all close to 0.05, I changed their value to 0.05 so that they would be in that bucket.

With the buckets all set up, I fit a linear model with OLS and then outputted a one-way ANOVA table, using ‘discount’ as a categorical variable and ‘quantity’ as the target variable. The ANOVA gave an insignificant p-value as a result and thus I could not reject the null hypothesis for this part of the question and instead had to accept the alternative hypothesis that the quantity ordered does not vary significantly across the different discount levels.

This result is somewhat surprising as my initial thought was that the higher the discount, the more product would be ordered. This could ultimately could prove to be true, given more data.

I now had to ask a question of my own and wanted to follow-up with the topic of discounts. I noticed that there was a category given to each product. I thus decided to ask the question,

“Are there certain products that are better to discount than others in order to increase the quantity ordered?”.

To answer this question I had to merge the tables ‘order_details’ and ‘products’ and then merge that table with the ‘categories’ table in order to have the variable ‘CategoryId’ for each order. Since I was just concerned with the whether there was a difference across categories when a discount was applied, I had to do what I did with the first question and modify the ‘discount’ column to have values of 1 if there is a discount and 0 if there is not.

With my variables set up, I then decided to run a two-way ANOVA as both of my independent variables were categorical. In the formula of the OLS regression I selected the ‘Discount’ and ‘CategoryName’ as an interaction.

The result of the two-way ANOVA gave an insignificant p-value for the interaction variable. This means I failed to reject the null hypothesis and that there is not significant evidence of the categories of product responding to having a discount in different ways.

Looking at the ‘order_details’ table I noticed there was a ‘UnitPrice’ variable. This led me to ask the question,

“Is there an optimal discount level based on the price of the product?”.

Initially I thought that maybe discounting expensive products more would lead to a higher order quantity compared with lower priced items as more money would be saved by the customer.

To test this hypothesis, I had to put the discount values into buckets again by setting those outlier values to 0.05. Since this time I was using a categorical variable, ‘Discount’, and a continuous variable, ‘UnitPrice’, I decided to use ANCOVA. To do this I ran an OLS regression including the interaction term of ‘Discount’ as a categorical variable and ‘UnitPrice’.

The results of the regression gave insignificant p-values for all of the interaction terms, for each level of ‘Discount’. This surprised me as I would have thought that there was a significant interaction at one of the ‘Discount’ levels at the very least.

I now decided to shift focus a little and look at the employees that facilitated the sales. I wondered if maybe different employees sold more on average per order than other employees and thought this might be interesting to the company. Thus my next question was,

“Do some employees perform better in terms of average order price than others?”.

I noticed that the ‘orders’ table had the variable ‘EmployeeId’ so I merged the ‘order_details’ table with it. Now I needed to calculate the sale price of each order. I did this by multiplying the ‘UnitPrice’ and ‘Discount’ columns and then subtracting that from the product of the ‘UnitPrice’ and ‘Quantity’ columns and putting that into a column ‘Sale’.

I then decided to use a one-way ANOVA as I had one categorical variable, ‘EmployeeId’. The results showed a significant p-value. This told me that there was some significant difference between the average order sale amount between the different employees. However, what it did not tell me, was where that difference existed.

To investigate further where that difference or differences lied, I decided to use Tukey’s Honestly Significant Difference or HSD to compare all the employees against each other. The results of the HSD test showed no significant result for any of the comparisons. This was definitely surprising as I thought that since the ANOVA showed signficant results, that there had to be some employee that performed better than other.

Upon further research I realized that the reason I didn’t get a significant result from the Tukey’s HSD test was because the test accounts for keeping the family-wise error rate at the alpha value of 0.05. I tested increasing the FWER and saw that once it got high enough, around 0.4, some of the comparisons showed significance. Ultimately, the results of the ANOVA and Tukey’s HSD test told me that the employees do perform differently but that we can’t say that any one employee outperforms another in a statistically significant way.

To follow up with the previous question, I wondered if maybe some of the employees were specialists and performed better selling certain types of products. This led me to ask the question,

“Are some employees better at others at selling a certain category of product?”.

For this question I had to perform a merger of 4 different tables, ‘order_details’, ‘orders’, ‘products’, and ‘categories’. I also again had to create a ‘Sale’ column with the sale price of each order after discounts. Since I had two categorical variables, ‘CategoryName’ and ‘EmployeeId’, I decided to run a two-way ANOVA with ‘Sale’ as the target variable.

The results of the ANOVA gave an insignificant p-value for the interaction term which made me fail to reject the null hypothesis. It suggested that there was no significant difference in the average sale price for each employee across all categories. This was interesting as viewing a bar graph of the average sale amounts for each employee and each category of product seemed to show lots of variance. I ran a comparison using a Tukey’s HSD test just to double check my result and it confirmed it as no comparisons were significant.

Finally I decided to ask one last question. Since I had worked a lot with discounts and categories in this project I wondered what type of products were discounted the most. I thought this would be interesting for the company as they might not realize where they lose the most money from giving discounts. Thus, the question I asked was,

“For which categories of product do customers save the most money?”.

To answer this question I had to merge the tables ‘order_details’, ‘products’, and ‘categories’. I also had to create a column called ‘Saved’ which was calculating by multiplying the ‘Discount’ column by the ‘UnitPrice’ column. I then ran a one-way ANOVA with ‘CategoryName’ as a categorical variable and ‘Saved’ as the target variable.

The result of the ANOVA gave a very low, significant p-value which told me that there was definitely some significant difference in the average amount saved per order for the different categories of product. I now had to find where that difference was and again chose the Tukey’s HSD test to find it.

The comparison test results showed that ‘Meat/Poultry’ products were more heavily discounted than all the other categories. This information could be useful then for the company because they know that their discounts save a lot of money for orders of meat/poultry products. The company could use that information to maybe reduce discounts given for that category or to embrace that fact and advertise more heavily that they offer great discounts on meat/poultry products.

This project enforced many of the topics I had studied in Module 2. I had to work a lot with ANOVA’s and hypothesis testing so I got to practice choosing the best statistical tests to choose for many different situations. It also made me do some further research and find methods such as Tukey’s HSD. Although the database used was for a fictitious company, the questions asked and conclusion drawn would be relevant to any real company looking to investigate their sales data.

Hypothesis Testing using Northwind Database

Written by Matt