Time Series Analysis — Energy consumption in Barcelona

Published in

LatinXinAI

7 min readDec 11, 2023

Nowadays studying Data Science and developing personal projects is easier than ever. There are many websites with free and interesting datasets out there. This is just one of many examples of that. On the Barcelona government website, I found this dataset that gives us a very good starting point for a Time Series analysis.

BTW, if you want to see a version with more code and interactable plots, you can go to my website.

Data cleaning and preparation

The pandas library has many functionalities to work with time series. But to really make the most of these capabilities, it's essential to have a date-time index. That's what we're doing in the code cell below. I won't go into much detail here. We're just renaming columns and preparing the date to be a useful DateTimeIndex.

import pandas as pd
import numpy as np
import plotly.io as pio
pio.renderers.default = "plotly_mimetype+notebook"
df = pd.read_csv('data_raw.csv', thousands='.')
# Drop empty column
df.drop(columns=['Unnamed: 1'], inplace=True)
# Format year and months
df.Periodo = df.Periodo.apply(lambda x: x.split()[-1] if 'Año' in x else x)
df = df.iloc[4:,].reset_index(drop=True)
months = ['Enero', 'Febrero', 'Marzo', 'Abril', 'Mayo', 'Junio', 'Julio', 'Agosto', 'Septiembre', 'Octubre', 'Noviembre', 'Diciembre']
d = dict(zip(months, np.arange(1,13)))
df['Periodo'] = df.Periodo.str.strip().replace(d)
def format_date(row):
    if isinstance(row['Periodo'], str):
        return row['Periodo']
    year_index = row.name - row['Periodo'] 
    year = df.loc[year_index, 'Periodo']
    return '1-'+str(row['Periodo']) + '-' + year
df['Periodo'] = df.apply(format_date, axis=1)
df = df[df['Periodo'].str.contains('-')]
# Change column names
renamed = dict(zip(df.columns, ['period', 'total', 'domestic', 'low_voltage', 'high_voltage']))
df.rename(columns=renamed, inplace=True)
# Set datetime index
df['period'] = pd.to_datetime(df['period'], dayfirst=True)
df.set_index('period', inplace=True)
# Keep only the years where we have months
df = df['2002-01-01':]
df.head()

Start Exploratory Data Analysis

Let’s start with plotting all the data to see how it looks.

The peak in February 2004 is explained in the original source of the data:

Low voltage gives error due to problems with the S.C.E. In march the excess gets compensated.

To balance it, let’s set the average of both months to these entries. The good thing about using a date as an index is that we can select the entries based on the date. Here we’re selecting the values for february and march.

feb = df.loc['2004-02-01',]
mar = df.loc['2004-03-01',]
df.loc['2004-02-01':'2004-03-01','low_voltage'] = (feb['low_voltage'] + mar['low_voltage']) / 2
df.loc['2004-02-01':'2004-03-01','total'] = (feb['total'] + mar['total']) / 2
df.loc['2004-02-01':'2004-03-01']

Long-term trends — rolling average

Another advantage is the ability to resample the dataset very quickly. Here, for example, we can easily get the sum of the consumption along each year. This reveals that there’s been a reduction in domestic energy consumption in recent years, along with a slight increase in non-domestic consumption.

data = df.loc[:,['domestic', 'low_voltage', 'high_voltage']].resample('A').sum()

Adding all the values in the year gives a good estimate of the trend. But a better estimate is the rolling averages. Rolling statistics are often used to smooth out short-term fluctuations and highlight long-term trends or patterns in the data.

data = df.loc[:,['domestic', 'low_voltage', 'high_voltage']].rolling(12).mean()

By making the rolling average of the past 12 months, we can see more clearly the monthly trends. The effects of big macroeconomic events such as the 2008 debt crisis and the COVID pandemic are also very obvious.

Intra-year variability — detrending

Now we want to see if there are months where the energy consumption is higher than other months. We could plot the mean of each month, but here we are plotting all the years together to see how all the trend flows. By subtracting the rolling mean from the original time series, we remove the long-term patterns, hence — we are removing inter-year variability. Then we have to transform the years into columns and keep each month as a separate row. Pandas pivot_table method is particularly useful to do that.

In general, we see that in the summer months there’s a decrease in domestic energy consumption and an increase in the rest of low voltage (corresponding to small businesses) and high voltage.

In green we’ve marked the year 2009, when the energy in Barcelona moved to another company. In red we’ve highlighted the year 2020, where we can clearly see the effect of the lockdown.

sub_df = df.loc['2003':'2021', [column]]
rolling = sub_df.rolling(12).mean()
df_detrend = sub_df - rolling
month = df_detrend.index.month
year = df_detrend.index.year
df_detrend['Month'] = month
df_detrend['Year'] = year
df_detrend = pd.pivot_table(df_detrend, index='Month', columns='Year', values=column)

The valley seen in 2009 is explained by this:

(5) As of July 2009, a new company Endesa Energía XXI S.L. was created, which it inherited from the old company Endesa Distribución S.L. all clients up to a power of 10 KW, the rest have gone to Mercado Libre.
(6) As of September 2009, the energy billed by the new company Endesa Energía XXI S.L. is included.

Periodicity — First-order differencing

Time series normally have two basic components: trends and seasonality. Until now, we’ve taken a look into the trends, both long and short-term. To look at seasonality, we can apply differencing. This computes the difference between consecutive observations in the time series, and helps in focusing on the changes within each season. First-order differencing has many applications in different fields, like Heart Rate Variability analysis

In our case, it helps to remove the long term trends and to see the periodicity of the data, as we see in the plot. We can clearly appreciate “waves”, or a frequency pattern, in the energy consumption.

diff = df[['domestic', 'low_voltage', 'high_voltage']].diff()

Correlation

When we first explored the data, we saw that, in the long term, the domestic consumption was decreasing while the high voltage consumption was increasing. We should expect, then some correlation between the two variables. But when we compute the correlations, we don’t see very high values (left heatmap). This is due to the fact that correlation calculated with the unprocessed data is a measure that doesn’t distinguish between trend and seasonal components explicitly. In other words, while there may be a negative correlation between two variables in the long term, the positive seasonal correlation (right heatmap) would cancel the first.

But these values aren’t still correct. A quick look at the first order differencing plot (above) reveals us that the data has a lot of noise. To really see the correlation between the components, we have to decompose the data.

general_correlation = df.iloc[:,1:].corr()
seasonal_corr = df.diff().iloc[1:,1:].corr()

Decomposition

Apart from trend and seasonality, there’s also a lot of noise and other short-term fluctuations. While trends correspond to the long-term direction of the data and seasonality captures repeating patterns, residual represents the remaining variability. Any decent forecast task that we want to do has to take these two components (and the residuals) into consideration. By understanding the trends and seasonality inherent in a time series, we can build more accurate and reliable forecasting models.

The python library statsmodels has a very useful method to do this. Now, the seasonal plot makes much more sense than before

from statsmodels.tsa.seasonal import seasonal_decompose
decompositions = {column: seasonal_decompose(df.loc[:,column]) for column in df.columns[1:]}

We can recalculate the correlations, this time with more accurate values. Now we see something that makes sense. There’s a negative correlation between the domestic energy consumption and the rest (corresponding to small business and industry). On the trend component, this measure is quite low, but it’s much higher in the seasonal component, as we expected when we plotted the intra-year variability.

Next steps

The next thing to do is prepare the data to do a forecast model. We’ll see that in the next part of the article. Or, if you can’t wait, you can go to my website, where you’ll find the specific code that I wrote for the plots too!

Do you identify as Latinx and are working in artificial intelligence or know someone who is Latinx and is working in artificial intelligence?

Get listed on our directory and become a member of our member’s forum: https://forum.latinxinai.org/
Become a writer for the LatinX in AI Publication by emailing us at publication@latinxinai.org
Learn more on our website: http://www.latinxinai.org/

Don’t forget to hit the 👏 below to help support our community — it means a lot!