Q#89: Candy production increase

Candy Island <- Where candy is made.

The following dataset shows the U.S. candy industry’s ‘industrial production index’ (you can learn more here if interested, though not relevant to question).

Given the above data, determine if the production in 2015 is significantly higher than in 2016.

TRY IT YOURSELF

ANSWER

“Did we improve this year?”, more likely than not in your data science journey you will hear some variation of this question as most businesses are only focused on the bottom line (hooray capitalism!). There are many ways to tackle this type of question, but all of them will involve some method of data manipulation and statistics! Since we are Data Scientists, we will tackle this with python and the T-Test.

To determine if the candy production in 2015 is significantly higher than in 2016, we will formulate our null and alternative hypotheses for the T-Test:

  • Null Hypothesis (H0): The mean candy production index in 2015 is equal to the mean candy production index in 2016.
  • Alternative Hypothesis (Ha): The mean candy production index in 2015 is significantly higher than the mean candy production index in 2016.

To test our hypothesis, we will perform a t-test in Python using the scipy.stats library. This library provides functions for various statistical tests, including t-tests.

Here's how we can do it:

import pandas as pd
from scipy.stats import ttest_ind

# Load the dataset from the provided URL
url = "https://raw.githubusercontent.com/erood/interviewqs.com_code_snippets/master/Datasets/candy_production.csv"
data = pd.read_csv(url)

# Convert 'observation_date' to pd.datetime
data['observation_date'] = pd.to_datetime(data['observation_date'])

# Filter data for years 2015 and 2016
data_2015 = data[data['observation_date'].dt.year == 2015]
data_2016 = data[data['observation_date'].dt.year == 2016]

# Extract candy production values for each year
production_2015 = data_2015['IPG3113N']
production_2016 = data_2016['IPG3113N']

# Perform the t-test
t_stat, p_value = ttest_ind(production_2015, production_2016, equal_var=False)

# Define significance level
alpha = 0.05

# Print the results
print(f"t-statistic: {t_stat}")
print(f"p-value: {p_value}")

if p_value < alpha:
print("Reject the null hypothesis. Candy production in 2015 is significantly higher than in 2016.")
else:
print("Fail to reject the null hypothesis. There is no significant difference in candy production between 2015 and 2016.")

Looks like in our case, because the p-value was greater than alpha, we fail to reject the null hypothesis and can’t say 2016 was any better or improved upon from 2015.

Plug: Checkout all my digital products on Gumroad here. Please purchase ONLY if you have the means to do so. Use code: MEDSUB to get a 10% discount!

Tips and Donations

--

--