Lab 1 (Case Study): Time Series Analysis using electricity production data

Dr. Vipul Joshi
Time Series Using Python
4 min readApr 22, 2024

Previous Articles to go through before lab:

Part 1: https://medium.com/time-series-using-python/introduction-to-time-series-analysis-part-1-3ed1ce9cac98

Part 1.1: https://medium.com/time-series-using-python/introduction-to-time-series-analysis-part-1-1-stationarity-0e239c9cc039

Part 2: https://medium.com/time-series-using-python/introduction-to-time-series-part-2-forecasting-ba0cdb9107a1

Electricity Production Data

link — https://www.kaggle.com/datasets/shenba/time-series-datasets

  1. Download the above data
  2. Extract the data on local system
  3. Upload the Electric_Production.csv to the notebook
  4. Once uploaded, copy path

Let’s explore the data a little bit. See first few entries, make sure there are no nulls, find its stats and then plot it

df = pd.read_csv("/content/Electric_Production.csv", delimiter = ",", index_col = "DATE", parse_dates = ["DATE"])
df.head()
df[df.isnull() == True].sum() #No nulls

#Output:
electricity_production 0.0
dtype: float64

df.plot(figsize = (15,8))

Looks like the data has some seasonal/cyclical pattern and some trend.
Let’s break this data into train and test data
check the stationarity of the training data, if the data is non-stationary, then we will difference it to make it stationary
Then we will find the lag at which the AIC is best
Predict the future~

df_train = df[0 : int(0.8*len(df))]
df_test = df[int(0.8*len(df)) : ]
len(df_train), len(df_test)
#Output:
#(317, 80)

# Apply differencing and perform tests
lags = np.arange(1,15)
for lag in lags:
lagged_data = df_train.diff(lag)
results[f'Diff_{lag}'] = perform_tests(lagged_data)

# Lets see if the Time Series became stationary post differencing or not?

results_df = pd.DataFrame(results).T
results_df
training_diff = df_train.diff(1).dropna()
training_diff.head()
# First Difference to remove trend
training_diff = df_train.diff(1).dropna()

# Re-test for stationarity
result_diff = adfuller(training_diff)
print('ADF Statistic: %f' % result_diff[0])
print('p-value: %f' % result_diff[1])
print('Critical Values:')
for key, value in result_diff[4].items():
print('\t%s: %.3f' % (key, value))

########################Output#########################
ADF Statistic: -6.352998
p-value: 0.000000
Critical Values:
1%: -3.452
5%: -2.871
10%: -2.572
########################################################

training_diff.plot()
# Step 2: Determine Lag Order
# Plot the Autocorrelation Function (ACF)
plt.figure()
plot_acf(training_diff, alpha=0.05)
plt.title('ACF for Electricity Production')
plt.show()
test_lags =  np.arange(1, 100)
test_aic = []
# Loop over possible lags from 0 to 20
for lag in test_lags: # including 0 to 20
if lag == 0:
continue # skip lag 0 because AutoReg requires at least one lag
model = AutoReg(training_diff, lags=lag)
model_fit = model.fit()
aic = model_fit.aic
test_aic.append(aic)

plt.plot(test_lags, test_aic)
plt.xlabel("Lag Order")
plt.ylabel("AIC Values")
plt.grid()

From the above graph, The drop in AIC from 3000 to 1200 happens at lag 20 after which the drop in AIC is negligible.
So, lets use lag = 20 for forecasting

lag = 20
model = AutoReg(training_diff, lags= lag)
model_fit = model.fit()
forecast = model_fit.predict(start=len(training_diff), end=len(training_diff) + len(df_test)-1)
# Reintegrating the forecast:
# Start with the last known value from the original non-differenced data
last_original_value = df_train.iloc[-1]

# Reintegrate the forecasted differences
forecast_values = [last_original_value + sum(forecast[:i+1]) for i in range(len(forecast))]

# Plotting the original data and the forecasted values
plt.figure(figsize=(20, 5))
plt.plot(df_train.index, df_train.electricity_production, label='Training Data')
plt.plot(df_test.index, df_test.electricity_production, label = "Testing Datan")
plt.plot(df_test.index, forecast_values, label='Forecasted Data at Lag Order = ' + str(lag), color='red', linestyle = '--')
plt.title('Original Data and Forecasted Data')
plt.ylabel('electricity_production')
plt.legend()
plt.show()
lag = 5
model = AutoReg(training_diff, lags= lag)
model_fit = model.fit()
forecast = model_fit.predict(start=len(training_diff), end=len(training_diff) + len(df_test)-1)
# Reintegrating the forecast:
# Start with the last known value from the original non-differenced data
last_original_value = df_train.iloc[-1]

# Reintegrate the forecasted differences
forecast_values = [last_original_value + sum(forecast[:i+1]) for i in range(len(forecast))]

# Plotting the original data and the forecasted values
plt.figure(figsize=(20, 5))
plt.plot(df_train.index, df_train.electricity_production, label='Training Data')
plt.plot(df_test.index, df_test.electricity_production, label = "Testing Datan")
plt.plot(df_test.index, forecast_values, label='Forecasted Data at Lag Order = ' + str(lag), color='red')
plt.title('Original Data and Forecasted Data')
plt.ylabel('electricity_production')
plt.legend()
plt.show()

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —

In future following post, we will pick it from here and perform further analysis

https://medium.com/time-series-using-python/introduction-to-time-series-part-2-1-arima-models-forecasting-2c36b464ba5d

--

--