# Maximizing energy trading profits: Predicting energy consumption using advanced neural network (Machine Learning — Deep Learning)

# Introduction

Over the years, machine learning has proven effective and efficient in increasing companies’ profits due to its forecasting ability in predicting trends benefiting business processes. As vast amounts of data are generated daily, finding underlying trends using accurate forecasting and process automation has evolved from being a ‘want” to a “need”. This resulted in the proliferation of machine learning usage across all industries, evident in the soaring demand of data scientists, and the rampant implementation of machine learning models to keep up with the ever-increasing data.

Deep learning, an extension of machine learning, due to its outstanding performance in generating consistently accurate forecasts has motivated Ai4Impact, with SGInnovate as the Deep Tech Partner, to organize this Deep Learning Datathon 2020.

**The Team**

Our team is called “JJJY”, a combination of our initials. My teammates consist of me (Lee Kahhe), Leow Yong Sen, Goh Jia Jun, John Vijay Balasupramaniam. We are all first-year undergraduates from National University of Singapore (NUS) and Singapore Management University (SMU).

# Objectives

*Problem Statement:** Playing the role of an energy trader. Develop an online deep learning algorithm for a T+18 hour energy forecast and maximize profits from wind energy farm operators in France*.

In this topic of Energy Trading Model, we will showcase our approach and the step by step to building, testing, and fitting our deep learning model to maximize profit using AutoCaffe.

# The Dataset

There are 3 main datasets: Wind Energy Production and 2 x Wind Forecasts.

The source of wind energy production comes from Réseau de transport d’électricité, (RTE) the French energy transmission authority. Denoted as “energy-ile-de-france”.

Wind Forecasts: The wind data comes from 2 different wind models, each consisting of the forecast from 8 locations in the Ile-de-France region, each forecast consists of 2 variables: wind speed (in m/s) and wind direction as a bearing (degrees North). Thus, there are 32(8x2x2) forecasts in all. The data is provided by Terra Weather.

All data provided and generated in the AutoCaffe that we are using are from 01 Jan 2017 to the present and are standardized as values measured at an hourly interval.

# Exploratory Data Analysis and Visualization

Here, we are using Python 3 — Pandas Library, Matplotlib, and Seaborn Library to help us conduct data analysis and visualization among the data.

We first tried to find the correlations of each forecast against wind energy productions via plotting a Seaborn heatmap of Correlations for all the data/forecast (Refer to Figure A). From this visualization, it is clear that the direction of the wind has a very low correlation with wind energy (Coloured dark purple and black). Whereas, wind speed has a higher correlation with wind energy (Colored orange to peach).

Let’s take a closer look at the exact values of correlation for each forecast using Python 3’s Pandas.

*Note: Forecast model B data are denoted with a “-b” behind the name of the data (I.e. speed-angerville-2-b)*

From the result in Figure B, we notice that there is a clear outlook that the directions of the wind have an extremely low correlation against wind energy. Whereas the speed of the wind, even the lowest forecast have 0.76 correlations against wind energy, highest at 0.81 (“speed-angerville-2-b”). With this evidence, we can further conclude that direction of the wind has very little correlation with wind energy whereas the speed of the wind is correlated with wind energy.

We then went forward to test all the forecasts against the Highest forecast from the previous result (“speed-angerville-2-b”) as shown in Figure C. This continued to indicate that direction of the wind will not be as useful at predicting wind energy and compared to the speed of the wind.

# Data Analysis — Time Series: Detecting Trend (Seasonality)

Moving forward, we observed and made preliminary deductions about the time series in wind energy. All graphs shown are Wind energy (Y-Axis) against time in hours (X-Axis). Using Python’s Matplotlib library, we plotted graphs of yearly records, “1 month”, “1 week”, “Weekly in 1 month”, “3 months” for time series analysis. In addition, we used a logarithmic scale to better observe changes between large numbers for the graphs of “Weekly in 1 month” and “3 months”.

After portraying the visualization of the results using time series graphs, we can conclude a few things:

1) From Figure F(b) that there is a substantial peak of wind energy in the middle of the week the amount of wind energy is above average for these periods in each of the 4 weeks.

2) From Figure E, our feature transformation needs to consider using MIN, MAX, MEAN, SD(spikes), RANGE(accompanies max/min), DIFF, SKEWNESS(spikes)

3) From Figure E, certain weeks have values lower than the average line, thus indicating that the interval appears to be shorter than weeks, which we believe that we should test for values more than or equal to 24

Due to time constraints, we did not take into account the data’s max, min, mean, standard deviation, and difference which could reveal more insightful trends

# First Trial (Default Test)

We first ran a test on the basic settings that AutoCaffe has configured along with some slight changes:

- With persistence-rmse of 0.4447 to beat

**Data.m: **All dataset: 32 columns in total (Normalized only the 1st and 2nd Column)**Prep.m:** Only the first 2 columns are normalized**Pre.m:** With a lead time of T +18 for both Columns A and B**Features:**- A:0:-4 MEAN

- A:0:-4 SD

**Config.m:**Using “Adam” as solver type

- 10000 iterations, with 10 repeats to verify the consistency and integrity of the result

- Neural network size (nn-size) set to test {{ 8 16 32}}

**Network.m:**

- 5 layer network with 2/3 reduction in network size for each layer, with Difference network

- Using Rectified Linear (ReLU) unit as the activation function for the inner product as the output layer

# Evaluation of Result for the First Trial

From the results above, we can conclude a few things:

- Based on training and test loss, we did not beat persistence, it was recorded at 0.469429 test loss with the configurations of 128 nn-size. In addition, our Training Loss > Test Loss and that there is a gap between them. Therefore, this indicates that we need to improve both our features and network
- Referencing to the predictions, the data points are quite scattered around, in line with the result of our high test loss percentage.
- There is also a test lagged peak at 17. Which then calls for testing on adding a dropout between each layer of the network layer.

Even though the evaluations turn out to be quite bad, looking at the bright side: the default model gave us a net profit of 1e⁹.

Since we have a clearer outlook of where the model should stand and what profit the default setting yields, let us now present our approach to bringing up the net profit to even higher!

# Our Approach

*Note: If you’re wondering about where our trial and error testing happens, not to worry because, in each section, we have included a note (“Trial and Error”) at the end before moving on to the next segment to let you understand what are the options available (limited to our knowledge) that we have tested and the corresponding result.*

# Building Data Extractions (data.m)

Referencing back to our data analysis, if you have observed carefully, on top of excluding all the directions of wind forecast in our data extractions since they have very little correlations to the wind energy. You would have noticed that most of the speed of wind forecast from forecast model B (denoted with the last letter being “b”) is of the highest correlations to the wind energy.

Since the wind speed in forecast model B has a higher correlation with wind energy, we initially decided to use wind speed in forecast model B for our forecast model. However, keeping in mind that,

“correlation does not imply causation”

We went forward to conduct extensive testing and found that wind speed in forecast model A, which has a lower correlation actually gave us better test loss. As a result, we use the speed of wind dataset and 2 of the directions data ($direction-parc-du-gatinais and $direction-arville) from forecast model A instead of forecast model B (I.e. Only using locations without the ‘-b’ notation).

The result of training and test loss reduce significantly and the predictions model presented more linear data points rather than being scattered around.

**Trial and Error**

Our team has also tried decremental and incremental selections of the forecast to be extracted:

- Since we have 2 wind forecast models (with 8 x speed and 8 x direction dataset from each model):

- First Segment: 8 x speed of the wind from forecast model A

- Second Segment: 8 x direction of the wind from forecast model A

- Third Segment: 8 x speed of the wind from forecast model B

- Fourth Segment: 8 x direction of the wind from forecast model B

We tested each segment incrementally and noticed that the result from the forecast model A (I.e. The First and Second Segment) produced a better result. - We continued to further split each segment and selectively test each forecast data. We discovered that by using all the speed data and 2 of the directions data ($direction-parc-du-gâtinais and $direction-arville) from weather forecast A, it minimized both training and test losses significantly.

# Feature Engineering (prep.m & pre.m)

**prep.m**

Instead of Normalizing just the first 2 columns of data. We now proceeded to normalize all the forecast data that we have extracted using the following formula:

**Data Transformations (pre.m)**

Extensive testing was also done here, taking into account the insight we got from initial data analysis before we derived on using the following settings which produce the best result:

- We have used MEAN, SD based on range 0:-40 for all Columns (Excluding Column B)
- Used a lead time of T + 18 for our Columns A, G, H, I, J, K
- Used a DIFF based on range 0:-40 for Columns A and I

Which gave us a test loss of only 0.215182, which beat our root mean squared error (rmse) persistence as well.

**Trial and Error**

We have also tried using a different set of features consisting of MAX, MIN, RANGE, SKEWNESS. However, the result turns out to be slightly worse in terms of training and test loss, the best configuration result was around 0.3 and above, which have beaten the persistence as well, however, we then aim to further reduce the test loss which then we derived at the above configurations.

# Defining the Network (config.m & network.m)

*Note: we have split our data into 70% training and 30% test set.*

**Config.m**

- With repeats set at 25 to verify the consistency and integrity of the result.
- We have also derived at nn-size of {{ 32 64 }} given that they gave the lowest training and test loss.
- Used Adam as the solver
- The input size is set at 38 given our number of features engineered in the previous sections.

**Network.m**

- We continued to use the default setting of a 5 layer network with 2/3 reduction in network size for each layer. As well as, using ReLU unit as the activation function for the inner product as the output layer.
- We have also included “L2” regularization and set the weight_decay to 0.000005 (5e-6).
- Dropout layer of 0.00001 and 0.00005 probability between first and second layer, as well as second and third layer respectively.

**Trial and Error**

- To perform gradient descent, we tested 2 types of solvers: Adam and Stochastic Gradient Descent (SGD). SGD turns out to have a much higher training and test loss as compared to Adam.
- We have also run tests on the various dropout probability {{ 0.05 0.1 0.3 }} and noticed that the lower the dropout probability, the better the training and test loss (i.e. lower loss). We even went forward to do multiple testing on manually inserting dropout layers between the neural networks to the extent of the dropouts probability of 0.00005. It turns out that our settings in network.m layer produce not only better result in terms of training and test loss, but also better in net profit! With lagged correlations remains at 0 in its peak.
- After several testing of the various nn-size ( {{ 16 32 64 128 }} ), it turns out that nn-size of 32 or 64 gave the best result.
- Repeats of testing are set to 25 ( 2 nn-size x 25 = 50 runs) given the restrictions of the runs (Capped at 50) AutoCaffe set for each team during this competition.
- For the output layer, we have also tested using the Hyperbolic Tangent activation function (Tanh). However, the result wasn’t as good as using ReLU.
- Several testing of the different weight_decay values and it turns out that the net profit was slightly higher for the setting of 5e-6.

# Evaluation of our Model

As seen in Figure O, we have a significantly low level of test losses for both training and test losses, plateauing at about 0.215 and 0.05 respectively. We achieved this through trial and error of finding the lowest local minima using gradient descent. Our low levels of test losses were also due to the usage of early stopping, which we incorporated in the model, enabling the monitoring of test losses and shutting down training whenever test loss starts to increase. Early stopping prevented over-training due to memorization and sped up training. As such, the resulting low test loss value of 0.215 was consistent and did not increase near the end of the training. This suggests that our model is able to predict novel data accurately throughout most iterations.

In Figure P, we can observe that there is a more distinct linear pattern in the scatter plot of both actual vs training/test prediction as compared to the default setup mentioned previously in Figure I. This means that our actual model yielded a test loss (0.215) of less than half as much as our first trial model. The predictions that our actual model made will be closer to the actual value than the trial model.

Finally, in Figure Q, our lagged correlation is significantly much better than the first trial (Figure J) due to it being closer to 0 as our actual model’s peak is around 1 as compared to the peak of the first trial, which is around 17. Although the shape lagged correlation doesn’t look perfect, it is definitely smoother as compared to the default setup.

**Trade.m**

Since we have normalized the data earlier, we have thus de-normalize the value back, before executing the Trading.

From Figure S, we can clearly see that our model not only achieves a consistently accurate forecast which results in generating Sale and Cash at a positive level but also keeping Wasted results low and Penalty at zero.

In detail, our model started off at just 100,000 dollars, and successfully closed with a net profit of 15,821,632.67 dollars after around 1 year 1 month. Which is a pretty decent profit! A difference of around 5 million dollars as compared to the First Trial.

**Live Deployment Result**

Our model was put to test over a period of 1 week (22nd July 2020 00:00 UTC to 28th July 2020 23:00 UTC). We are extremely grateful and happy that out of 48 teams, our model yields the third highest profits!

# Further Improvements

During the live deployment, our model kept overestimating energy generated when the actual energy generated was lower than average. We actually tried to figure out why our model was doing badly especially during periods where energy generated was low. What we found could be useful information for improving our model.

Trying to learn from our model’s forecasts, we plotted the following graph.

The graph shows our forecast in orange and the actual energy generation in blue, below we put a graph for humidity to find out if there were any correlations between energy forecast and humidity. Humidity is also usually tied with weather so out of curiosity we decided to analyze weather as well. Here is what we found.

The graph shows our forecast in orange and the actual energy generation in blue, below we put a graph for humidity to find out if there were any correlations between energy forecast and humidity. Humidity is also usually tied with the weather so out of curiosity, we decided to analyze weather as well. Here is what we found.

The humid air implies lower density resulting in lower power from a wind turbine.

— Danook, S. H., Jassim, K. J., & Hussein, A. M. (2019). The impact of humidity on performance of wind turbine. *Case Studies in Thermal Engineering*, *14*, 100456. https://doi.org/10.1016/j.csite.2019.100456

Hence upon knowing humidity as well as weather play a crucial role in undermining our model. We could learn from our mistakes by include weather data as well as humidity data in the future to make our model more robust.

# Disclaimer of Our Model

Our model, as accurate it is to predict and maximize profit from energy prices, is imperfect and has its limitations. There are various schools of thought that we did not explore due to our lack of experience with different statistical methods that could have improved our test loss, these methods include, feature engineering and data fusion techniques (i.e. Interpolation, spatial averages, align, aggregate, temporal averages, etc.). Additional statistical methods could be used instead of the defaults-averaging, interpolation-such as Subset Selection, Shrinkage, Non-linear models, and Tree-based methods which could have yielded us a lower test loss. We felt that since we were still relatively inexperienced with handling neural networks, it was best that we kept to what we knew and made our neural network as simple as possible without including too many statistical methods that might compromise our model. As a result, we do acknowledge that our model still can be improved once we have gained more experience and knowledge of the aforementioned statistical methods.

In our feature engineering, we could use additional features from external datasets like rain forecast, the temperature so as to improve the relevancy of the model. Data extraction techniques could be used for these extra datasets such as using Web scraping and API extraction. Due to the lack of experience as well as time, we chose to keep our model as simple as possible and chose not to include additional features since when we did so, we were not able to achieve a lower test loss. Perhaps with a little more time and experience, we have no doubts that our model could be further improved with the inclusion of additional features from external datasets.

# Conclusion

As Arthur Ashe once said, “Success is a journey, not a destination. The doing is often more important than the outcome.” Similarly, since the competition started on 29 June 2020, this journey has been fulfilling as we have learned a vast array of deep learning techniques from Arnold and his team from Ai4Impact. Through our “doing”, we have learned and discussed various deep learning techniques with our team, explored many ideas leading us to extensive multiple concurrent testing as well as brainstorming many alternative techniques that could potentially improve our model if given the time. As such, not only did we forge a friendship, but also it sparks our interest in continuing our journey to learn more about machine learning (including deep learning). In the future, we hope to apply our deep learning techniques in whichever organization, be it private or government, to achieve better predictive analysis so as to improve organizational efficiency.

Thank you for reading this article! Feel free to provide any feedback/comments or to connect with us on LinkedIn:

- Leow Yong Sen: www.linkedin.com/in/yongsenleow
- Goh Jia Jun: https://www.linkedin.com/in/goh-jia-jun-587980182
- John Vijay Balasupramaniam: https://www.linkedin.com/in/john-vj
- Lee Kahhe: https://www.linkedin.com/in/jacobkhlee

*Written by: **Kahhe Lee**, Leow Yong Sen, Goh Jia Jun, and John Vijay Balasupramaniam.*