Optimising energy efficiency: Predicting energy consumption using ANN on energy consumption and weather forecast data

Lian Peng Cheng
14 min readJul 8, 2020
Photo of power line from https://www.investopedia.com/top-utilities-stocks-4582243


With increasing climate change awareness, the efficient usage of energy has come under the spotlight for many green initiatives. Energy wastage is costly for utility companies, where the energy consumption of households and offices could be lower than what utility companies produce daily. This will result in inefficient energy production, increasing the amount of harmful greenhouse gas emissions released into the atmosphere.

Singapore consumes a significant amount of energy and the number is growing year on year (Figure 1). The Energy Market Authority proposed an Energy Efficiency Grant Scheme for Power Generation Companies with the objective to encourage companies to invest in energy-efficient equipment or technologies that can improve the overall generation efficiency and reduce carbon emission for each unit of electricity generated. One of the solutions proposed is the adoption of machine learning to forecast energy consumption.

By using machine learning to predict energy consumption, utility companies will be able to optimise their operations and improve energy efficiency. Therefore, accurately predicting energy consumption will benefit utility companies, building owners and consumers while reducing the impact on the environment.

Figure 1. Yearly energy consumption by sector in Singapore


The objective of our project is to build a model to forecast the energy consumption of a particular building based on the past energy consumption of 15-minutes intervals and the temperature forecasts from four different weather stations located near the building. An effective model will be able to forecast 24 hours into the future (T+96) and provide accurate predictions for energy consumption. Using a Multi-Layer Perceptron Network, our team will attempt to create an energy consumption prediction model with high accuracy.

Exploratory Data Analysis

Energy Consumption Dataset

With the energy consumption dataset provided courtesy of Schneider Electric, we performed an exploratory data analysis to gain an insight into the seasonal decomposition of the energy consumption. The dataset contains energy consumption data in 15-minutes intervals from 24 July 2014 to 26 May 2016 (Figure 2).

Figure 2. Energy consumption against time
Figure 3. Seasonal decomposition of energy consumption time series

From the graph and seasonal decomposition (Figure 3), we observed a clear yearly cycle and monthly seasonality. However, the energy consumption for 2014 bucks the trend as it appears to be much lower than the average. One possible reason could be that the building was not at full occupancy during that period. To determine if the consumption data in this period was relevant, we decided to perform further analysis.

Upon further investigations, we realised that there was a pickup in energy consumption from 23/11/2014. We did an analysis of the average amount of energy consumption from 23/11/2014 to 31/12/2014 and the same period in 2015:

It could be seen that the average consumption for this date range in 2014 was much lower than that of 2015. As we require our model to be forward-looking, the data in 2014 may not be relevant in predicting the future trend. Hence, we concluded that the energy consumption in 2014 is an outlier and removed it from the dataset.

As our project aims to predict consumption 24 hours (T+96) ahead, we also did an analysis of consumption on a weekly basis. We sampled one week from the consumption dataset and created a 24 hours window to separate the consumption data by the day of the week. True to our expectations, energy consumption was much higher on weekdays than on weekends (Figure 4). We could also see another seasonality trend within the day as energy consumption peaked in the late morning and late afternoon with a drop in between. This could be because the building is a factory, where energy consumption increases during working hours, and falls during lunchtime when operations are temporarily halted.

Figure 4. Weekly energy consumption against time

In addition, the dates of the public holiday (Figure 5) were also provided in the dataset. We found that energy consumption is much lower on a public holiday (Figure 6), which is to be expected. The consumption pattern is also similar to that of the consumption on weekends.

Figure 5. The dates of public holiday from 2014 to 2015
Figure 6. Energy consumption difference between a public holiday and normal workday

Temperature Forecast Dataset

Temperature forecasts in degrees celsius were also provided from four different weather stations located in the vicinity of the building. The four weather stations, named wx1, wx2, wx3 and wx4 are in order of the distance relative to the building. We plotted the temperature forecast datasets against the energy consumption for the entire date range. We observed that the temperature follows a yearly seasonality. In general, there is an inverse relationship between the temperature and the energy consumption of the building. This could be due to higher energy demand due to the use of heaters in the cold weather.

Figure 7. Temperature forecast against the consumption of three stations

As the date range of the data from weather station 4 falls outside of the date range of the energy consumption data, we will not consider data from wx4 since it would offer little predictive value in our model.

Data Pre-processing

Missing Data

In the process of analysing the data, we found multiple data gaps in both the energy consumption and weather forecast datasets, where long periods of data are missing and not available. In order to deal with the missing data, we analysed the biggest data gap for each dataset and found the following:

Figure 8. Largest data gap for each dataset

We brushed up the energy consumption dataset using 15 mins align to create fixed 15-minutes intervals between each data point. For the missing data, we took the previous week data as a close substitute. For example, if Monday data is missing, it will be filled with the previous Monday consumption. This will continue for any number of consecutive missing data.

However, for the public holidays, we will take the closest Sunday value as a proxy. The reason for taking previous values is to prevent data leak which can happen if linear interpolation is used. Likewise, we hope to maintain the consistency of consumption pattern by using the same day of the week and replicating the low consumption for the public holidays.

Next, we merged the temperature forecasts from weather stations wx1, wx2 and wx3 with the energy consumption data. We fill the empty rows with 4 days linear interpolation since we would have all the forecasted value ahead of T+0, so there will be little to no risk of data leakage.

Feature Selection

Our initial selections of features are as follows:

  • Historical consumption
  • Temperature forecast by 3 weather stations
  • Month (encoded)
  • Workday/Non-workday
  • Cyclical sine function using hour
SIN( 3 x PI x (hour/24) )
Figure 9. Correlation matrix on our selected features

Aside from the consumption and temperature forecast dataset, we looked into other features that can potentially improve our model’s predictive ability using Spearman’s correlation. We decided to combine weekends and public holiday together as one category defined as is_workday = 0 since the consumption pattern is fairly similar.

However, hour (encoded) has very little correlation with consumption. We look for another approach to capture the swing during working hours and non-working hours. Using the sine function on the hour, we are able to replicate a cyclical trend, which has a higher correlation to consumption.

Feature Normalisation

The data will be split into training and test data sets in a 70:30 ratio before the model training commences. The split happened before normalisation as the normalisation parameters may be derived from the test dataset, causing a data leakage. Test dataset should always be kept “hidden” from the model.

We then normalised the values of each of our features using min-max scalar using train parameters in our preprocessing step using the formula as shown below:

Data normalisation is used to ensure all data inputs are about the same size and magnitude of ~1. Large inputs will dominate learning and will make it harder for the model to learn. The normalisation technique will be fixed in the beginning to avoid varying normalisation during training, testing and deployment.

Figure 10: Consumption histogram for the year 2015

Min-max scalar is our preferred choice over other scaling techniques as we want to keep the normalised value between 0 and 1 so that the model can learn better. Consumption analysis also showed that the data has consistent “boundaries” at the maximum and minimum, hence, any outliers will be greater than 1 and the model will learn to ignore. Lastly, the distribution of consumption is not gaussian distributed to justify using standardisation.


We set the benchmark as persistence. Persistence is when the predicted T+96 value is the same as the value at current T+0. Using RMSE loss, we calculate the persistence to beat as 0.014842.

Naive Model

Initially, we used a simple 4-layer network with just the naive window. After training with Adam solver and varying network size, we arrived at the best test loss of 0.01387 with network size of 32, which beat persistence.

The loss graphs and predictions from the model are as shown:

Our naive model managed to beat persistence. However, there were still significant gaps between the training and test losses, which shows that the model is not able to generalise to data which it has not seen before.

Using the model for predictions, we see wide dispersions in the test and actual scatterplots. We also see a peak in the lagged correlations graph at T+96, with a secondary peak at T+0. The features can still be fine-tuned to improve the lag and gap between train and test losses.

Neural Network Methodology

Feature Engineering

Since the simple model still can be improved, we went back and tried to fine-tune our features. We incorporated three main ideas as our features, namely calendar-based features, forecasted temperatures and past energy consumption.

Figure 11. Experimented features and the respective improvements

Firstly, we used the day of the week and month as our calendar-based inputs. As observed from the data analysis, is_workday have predictive value since consumption tends to be lower on weekends than weekdays. Similarly, hourly consumption differs within a day due to standard working hours.

Next, we used the T+96 and the T+96:0 MEAN temperature forecast from wx1, wx2 and wx3 as our temperature forecast inputs. Temperature forecast for each station is a feature separately to account for the potential difference due to the varying distance of the weather stations. Temperature also accounted for seasonal change in a year from summer to winter. This could be why the encoded month was not useful and only added noise.

Lastly, we used the 24-hours of historical energy consumption as our past energy consumption inputs with the daily (T0:-96) MEAN and SD. The mean of daily energy consumption provides a baseline prediction while the standard deviation tracks the degree of fluctuation from the mean. 6 days (T0:-576) historical consumption minimum and maximum was also used to capture the “boundaries” during the week.

We also included the daily difference, momentum (first-order difference) and force (second-order difference) with an 8-hours window (32 lag) to keep track of the difference and rate of change of energy consumption over the time period. Using difference, momentum and force as features allow the model to be fed information on how data changes along with time, or rate of change of consumption for each 8-hours interval.

We trained the model with our new input variables. With the new features, we achieved a test loss of 0.009526, which still beats persistence. The gap between train and test loss has reduced. Lag has also improved and the secondary peak at T+0 has become more pronounced.

Fine-tuning the Model

Once we were satisfied with our features, we began to fine-tune the configurations in our neural network, namely using dropout, regularisation, input scaling, clamping, autoencoder, squared perception, momentum and force losses.

Dropout is a technique where randomly selected neurons are temporarily ignored during training, and its weight does not change for that iteration. This forces a reduction in co-dependency and the other neurons have to take on more responsibilities, hence preventing overfitting.

Regularisation adds an additional factor to the loss, that penalises large weights in the neurons, hence decorrelate the neural network, and helps to prevent overfitting.

One way to reduce noise is to use input scaling. We add a scaling factor (lambda) to each input, where the network will learn the scaling factors. Unimportant input will have a small scaling factor and be ignored. Outliers will also be removed via a tanh activation function (clamp) due to big input.

The neural network typically does not learn well with large dimensional input. Autoencoder will force a compressed representation of the original input. The reduced input will allow the model to learn from smaller yet accurate input, hence preventing overfitting.

We noticed that a larger nn-size tends to have lower test losses. However, using a large nn-size will result in exponentially longer training time. Hence, our group concluded that 32 is the optimal nn-size.

During our feature engineering phase, we noticed that there are gaps between the test and training losses. We also noticed a V-shape in the test losses graph, and training was stopped early before 1,000 iterations. Both were signs of overfitting.

To reduce overfitting, we decided to select a high weight decay of 0.0001 and dropout-prob of 0.1. With this configuration, the model will tend to train for 3,000–4,000 iterations, with slightly better training and test graphs.

As our input size of 114 is rather large, we used an autoencoder with a bottleneck size of 1.5 to reduce the dimensions of our input vector. This enables the model to learn better with a small dimensional input vector.

The best results were:
Test loss: 0.00925955
nn-size: 32
Weight decay: 0.0001
Dropout probability: 0.1
Bottleneck size: 1.5
Solver: Adam

Next, we use squared perceptrons, which introduce squared input variables (x²) into the neuron. This activates faster training and better learning as compared to using ordinary perceptrons.

Lastly, we added momentum and force losses, which are the first and second-order difference of predicted versus actual, into our main RMSE loss to reduce lag. The final results are shown below:

Test loss: 0.00982457
nn-size: 32
Weight decay: 0.0001
Dropout probability: 0.1
Bottleneck size: 1.5
Solver: Adam

Although the test loss has increased, it is still well below persistence. More importantly, the primary peak for lagged correlation is now around T+0, indicating that the model is able to predict the energy consumption at T+96 with zero lag. The test prediction graph also shows that the model’s predicted value is not too far off from the actual.

Implementation Plan

We plan to create a model pipeline from the data source API (e.g. weather forecast station) to the ETL algorithm, which automatically performs data cleaning and manipulation before being fed into the model. The model will then predict energy consumption on a daily basis.

The model prediction has to be validated on a regular basis in order to assess the model’s predictive power. In the event of changing consumption trend, the model’s prediction will not be accurate, hence further fine-tuning may be required and training has to be done on a newer dataset.

However, the model pipeline has to be balanced with other factors such as cost. Daily feeding of data into the model can be costly. For example, maintenance of temperature sensors has to be done frequently, or a crucial feature may not be available in the event of failure. Similarly, running the prediction model daily will cost time and money for the company on computational power. Instead, the model can predict for a range of period rather than just T+96, to balance cost factors.

Conclusion and Review

Overall, it was a tough but interesting challenge. We did not manage to eliminate the gap between the training and test losses. Our scatterplots were still not fitted tightly. These problems arose due to the fact that we were not able to find strong features as our inputs to the model. Despite these issues, we managed to improve significantly from the persistence value through various trial and error.

Selecting and engineering features was probably the biggest challenge in this project. We initially tried to include the trends which we observed in the dataset as our features and add statistical moments to artificially create more features. This led to large input sizes and long training times. Hence, we had to tweak the model such that we retain only the important features while keeping the test losses low.

Our model certainly has its limitations. It is trained for a single building at a certain geological place. The features used, especially temperature forecast, may not be as effective and valid as a feature in other areas. In a region with a tropical climate like Singapore where temperatures do not fluctuate much across the year, our model may not be able to predict energy consumption accurately. Likewise, more data can be obtained externally and be used to train the model. Two years of data is insufficient to fully capture trends such as monthly and yearly trends.

We can improve our model by doing cross-validation, allowing us to tune hyperparameters with only the original training set. This allows us to keep the test set as truly unseen dataset for selection of the final model. We can only consider implementing more complex sinusoidal functions that match the energy consumption in a 24-hours period more closely. Other external data like humidity and wind could also be used to improve the model.

Nonetheless, we managed to create a model that is fairly accurate in predicting energy consumption for the building in the problem statement.

Written by: Chia Xun Ming, Claire Tan, Finney Neo, Lian Peng Cheng



Lian Peng Cheng

An individual who is passionate in the field of data science and analytics, machine learning and risk management. https://www.linkedin.com/in/lianpengcheng/