Predicting Counterfactual Building Energy Consumption

Mohammed Zoher
CodeX
Published in
20 min readSep 8, 2021

Introduction

In recent years there has been an increasing emphasis on reducing energy emissions and consumption by buildings. Minimizing energy consumption not only helps financially but also has a significant positive impact on the environment. As investments are made to cut energy usage levels, the question remains of how to measure their benefits?

Business Problem

The problem that we are trying to solve not only has financial significance but it can lead to significant environmental benefits as well. As environmental awareness has been growing, businesses are actively making efforts to reduce their carbon footprint, and one such effort is in terms of reducing the energy they consume. Commercial buildings are being retrofit to ensure efficient energy consumption. However, to quantify the value the investments are generating we need to be able to compare retrofit energy consumption levels versus the counterfactual (without retrofit investments) levels. This is important not only to identify the effectiveness of investments, but also to raise awareness about the potential savings that can be generated.

In the past researchers have tried various approaches to estimate these counterfactual consumption values, but these are limited by meter types or only apply to specific buildings. Hence the business problem that we are trying to solve here and generate value is to be able to create models that can predict counterfactual energy consumption values irrespective of the meter type or building type. In a nutshell our challenge is to build a scalable solution which is indifferent to building type or meter type.

Source of Data

The dataset that we will be working on has been collected by ASHRAE — The American Society of Heating, Refrigerating and Air-Conditioning Engineers. ASHRAE was founded in 1894 with the objective of advancing technology in the fields of heating, ventilation, and air conditioning refrigeration. The energy or meter readings dataset was collected over a period of three years a from January 1, 2016, to December 31, 2018 from 16 sites located around the world. Corresponding weather data for each of these sites has also been provided.

Data Description

The data has been divided into three linked tables, each of them has been described in detail below.

Meter Readings

This table contains the hourly meter readings for a given building and meter type. These meter readings are our target variables or what our machine learning algorithms ultimately need to predict. There are four distinct meter types for which we need to predict energy consumption levels: electricity, chilled water, steam, hot water.

Building Metadata

The meter readings have been collected from 1449 buildings located across the world. For each of these buildings we have been provided with their primary use, area in terms of square feet, construction year and floor count.

Weather Data

Each of the 1449 buildings have been divided into 16 sites based on their location. Hourly weather logs such as air temperature, cloud coverage etc. have been provided for these sites.

Data Tables and Connections

ML Problem Formulation

Now that we have understood the overall business objective and dataset, we need to be able to translate it into a data science objective. Using past or historical energy consumption data we need to be able to predict in the future, the counterfactual energy consumption levels. Since the target variable (energy consumption) is real valued, in ML terms we have a time-series based forecasting or regression problem. Since apart from just historical load data we also have additional data sources in the form of building metadata and weather data, to put it more precisely we have a multivariate regression problem.

Model Evaluation

For any data science problem we need to be able objectively evaluate the performance of our machine learning models. Hence, we need to clearly define a performance metric suitable for our problem.

The performance metric that we can use to evaluate and measure our models is Root Mean Squared Logarithmic Error, which is mathematically defined as follows:

As observable from the equation, in comparison to simple RMSE we are applying natural logarithm and then computing the squared difference between our actual and predicted values. The reason we have proposed to use RMSLE instead of the simple metric is because our target variable (energy consumption) is skewed and has possibly large non-negative values. Hence it is appropriate to apply log transformation to ensure that our metric is not affected by large outlier values. Additionally, because our target has large non-negative values, we would want to consider the relative error and neglect the scale of data, which RMSLE allows us to achieve.

Exploratory Data Analysis

Before jumping into solving our problem with complex machine learning algorithms, it is crucial to get a deep understanding into the data available and identify correlations between feature variables and our target variable.

We will focus our EDA efforts with three broad objectives in mind:

  • Data Understanding
  • Identifying Anomalies
  • Relationship between dependent variables and meter readings

Data Understanding

Under this section, our objective will be to understand the different data tables that have been provided and get a sense and feel of the different raw features available under each of these data tables.

Let’s first take a look at the different meter types and analyze our target variable (meter readings).

Count of Records by Meter Type

The above plot depicts the count of records or observations by different meter types. As observed from the above plot, most of the records or observations pertain to the meter id 0. Out of the 20 million meter readings, around 12 million (approx. 60%) of the records are only related to meter id 0. Meter ID 1 has the second highest observations of around 4 million (approx. 20%). And the remaining two meters make up the rest of the 20% of observations.

Meter Reading Distributions for Building ID 1232
Meter Reading Distributions for Building ID 1294

The above plots depict distributions of meter readings by all the meter types and for a given building id. In essence each plot relates to a single building and then each subplot depicts the energy consumption distribution by meter type. We have randomly chosen 2 buildings (which are installed with all meter types) to analyze the distributions. A careful analysis, reveals that in almost all plots there are a number of meter readings with zero values. Another observation is that in case of all meter types, we do have some range of normal readings but there are a few instances where extreme values can be seen or to put it concisely we have right skewed distributions.

Now we will explore the air temperature time series from our weather data.

Average Temperature trend at Site 9
Average Temperature trend at Site 8

These line plots illustrate the trends of daily average temperature for two randomly chosen sites. From the above plots, we can observe seasonal trends based on the average daily temperature. Both the plots show lower temperatures around the months of Jan-Mar. From the month of May, we can observe that the temperatures continue to increase up until around July and they start falling again from September. Intriguingly, when I plotted the same information for all remaining sites, a similar pattern was observed. Based on this we can conclude that all of our sites and consequently all of our buildings are located in the Northern Hemisphere.

Now we will highlight a couple of features available in our building metadata: square feet and primary use.

Count of Buildings by Primary Use

The bar graph shows the building counts by their primary use. As observed, roughly around 35–40% of the buildings are primarily used for educational purposes. After education, the second most observed primary use is “Office” which represent approximately 20% of all buildings. As most of the buildings are either educational institutions or office spaces, it would be important to consider factors such as time of day, day type, public holidays etc. while selecting features to predict the target variable.

Distribution of Building Square Feet

The above plot depicts the distribution of the building area in square feet. The distribution of building square feet appears to be right skewed, with most of the buildings having an area of less than 200k square feet. A very small proportion of buildings are spread across a large area. Although most of the buildings have similar area coverage, but we still do observe some variation in the building square feet distribution. And because energy requirements would largely be affected by a buildings size we need to consider this while defining our features.

Identifying Anomalies

Under this section we will try to focus our efforts on finding if there are any anomalies in our data which we might need to correct out in our data preprocessing.

Meter Readings for Building ID 320 and Meter ID 0
Meter Readings for Building ID 103 and Meter ID 1
Meter Readings for Building ID 3 and Meter ID 0

The above plots are time series plots filtered by a building id and for a specific meter type. As seen above, we have tried to plot samples of each of the different abnormalities that are observed throughout the dataset. In the first plot we can observe that there is a sudden negative spike. The second plot is an exact opposite of the first plot where we see constant values and then a sudden positive spike. The third plot type represents a scenario which is prevalent in the data where for a long period of time we have constant readings of zero. Because the meter readings are the target variables that we need to predict as part of this problem, it would be crucial to detect such anomalies across the dataset and remove these.

Several competition participants identified that for a few sites in the weather data, the time zones were in terms of UTC and not local time. In order to further analyze this from our end we will be plotting for a few site ids: the average temperature for each hour across the whole year.

Average Temperature by Hour Site 2
Average Temperature by Hour Site 13

The above plots filtered by site id reflect the average temperature at a given hour across the observed time period. In both the plots, we can observe that the temperature value peaks at around 2100 hours or 9pm. Similarly, we can also observe that the temperature drops at around 1200 hours or at noon. Based on these observations, we can definitely conclude that the timestamps in the weather data are based on UTC and not local. The temperature peak and drop seem to be counter-intuitive, we generally expect to see temperature peak during noon time and lower temperatures during night.

Relationship between Features and Target Variable

Under this section our objective will be to decode the relationship between our various features and target variable (meter reading). All analysis and exploration will be focused on finding features which are correlated or affect energy consumption.

Let’s begin by looking at how datetime features such as hour of day and type of day affects energy consumption.

Energy Consumption by day of week

The bar graph illustrates difference between energy consumption on weekdays and weekends. Each subplot represents data filtered out based on the meter id. For meter types 0,1,2 we can clearly observe that the median consumption on weekdays is higher than that on weekends. For meter type 3 the median consumption looks almost similar on weekend and weekdays but a closer look would reveal that the consumption is slightly higher on weekdays.

Energy Consumption by hour of day

These plots try to depict the energy consumption patterns based on the hour of the day. Each subplot is filtered by the specific meter type. One common observation among all meter types is that energy consumption is comparatively lower during night and higher during daytime. Meter ID 0 and 1 show an increase in energy consumption starting from around 10am and reach peak by afternoon. On the other hand Meter ID 2 and 3 reach their peak between 6am — 8am and then start falling by 9am. When we look at the data description provided, meter id 0,1,2 and 3 represent electricity, chilled water, hot water and steam related energy consumption. In this sense the observations also seem intuitive because we would expect chilled water related energy consumption to peak during afternoon hours and then fall off during night and this is exactly the behavior observed from the plot of meter id 2.

Next we will move on to identifying correlation between our weather features and target variable.

Weather Correlation Heatmap for Building ID 1296
Weather Correlation Heatmap for Building ID 1293

These heat-maps try to depict the correlation between weather variables and our target variable “meter reading”. We have randomly selected two buildings which have all four meter types and then tried to measure the correlation coefficient between meter readings and weather variables. For both the buildings, air temperature and dew temperature have very strong correlation with Meter ID 1,2 and 3. Air Temperature and Dew Temperature are both positively correlated with energy consumption readings from meter id 1. Since this meter relates to chilled water it also makes intuitive sense because as temperature increases we expect increased cooling requirements. On the contrary, air temperature and dew temperature have strong negative correlation with readings from meter id 2 and 3. This indicates that as temperature drops, there is increase in energy consumption, which again makes intuitive sense as meter 2 and 3 are related to heating requirements. Wind Direction, Wind Speed and Sea Level Pressure show weak correlation with meter readings, and the other variables have correlation values around zero.

Now we will focus on understanding how some of the building attributes can impact energy consumption patterns.

Scatter Plot: Building Square Feet vs Energy Consumption

These scatter plots try to represent correlation between building square feet and energy consumption for a given meter type. For each of the plots, we can observe that in most cases as the building area or square feet increases, we see an increase in the median consumption by the building. There is not an extremely strong correlation between building area and energy consumption, but we can generally say that as building area increases, we can expect an increase in energy consumption.

Energy Consumption by Building Primary Use

These plots try to draw out a comparison of energy consumption by primary use or purpose of a building. A quick observation for plots of each meter type reveals that energy consumption varies greatly based upon the building type. For meter id 0 we observe that buildings with primary use of health care have the highest median daily energy consumption. Similarly meter id 1,2 and 3 have highest consumption for building types utility, services and public services respectively.

Data Processing and Cleaning

Now that we have a good understanding of any inconsistencies in the data, we can work on cleaning and processing them. In this section we will focus on four broad activities:

  • Weather Data — Time Zone Offset
  • Cleaning Constant Zero Values
  • Cleaning Abnormal Spikes
  • Missing Value Imputation

Weather Data — Time Zone Offset

As discussed during the EDA phase, for a few site ids in the weather data we would need to offset the time zones. The below discussion provides the offset required to be applied for each site.

Time Zone Offset

Cleaning Constant Zero Values

During our EDA, we observed that there are meter readings with constant zero values for a long period of time. Now these constant zero values could be either because the corresponding system has been shutdown (for seasonal purposes) or they are just anomalies which we need to clean up. In order to clean out the unjustified zero readings we will be taking the following approach:

  • For meter id 0, remove all those rows where there are zero readings for 48 or more consecutive hours irrespective of the season. Since meter id 0 is related to electricity consumption we expect that it should be functional throughout the year.
  • For meter id 1 (chilled water), remove all those rows where there are zero readings for 48 or more consecutive hours except during winter (Dec, Jan, Feb, Mar).
  • For meter id 2 (steam) & meter id 3 (hot water), remove all those rows where there are zero readings for 48 or more consecutive hours except during summer (Jun, Jul, Aug, Sep).
Function for cleaning unjustified zero readings

Cleaning Abnormal Spikes

In addition to the constant zero values, during our EDA we have also observed abnormal positive and negative spikes. It is really important to clean these out so that it does not affect our model training process. To identify these abnormal spikes we will be creating groups based on the building_id, meter, and season. In each of our groups, we will be calculating z-scores for the meter readings and remove any reading which has a z-score of 4 or is more than 4 standard deviations away from the group meter reading mean.

Since all our buildings are in Northern Hemisphere we will be defining the seasons as below:

Seasons Definition
Function to identify abnormal spikes

Missing Value Imputation

During our data understanding phase, we found out that most of our weather variables have missing observations or records. We will be imputing these missing values using a simple linear interpolation strategy.

Function to impute missing weather variables using linear interpolation

Feature Engineering

The most crucial element of solving any machine learning problem is to be able to engineer features which are highly correlated with our target variable. We will apply our learnings from the third phase of EDA (where we explored the relationship between raw features and target variables) while creating new features.

Temperature Differencing Features

Apart from the raw weather features that we have already retained we will be creating two additional lag based features for air temperature as defined below.

Function to create first and second order temperature differencing
  1. first_order_temp: Here we will be performing first order differencing of air temperatures for each site. In other words this means the difference in air temperature between the current and previous timestamp.
  2. second_order_temp: Here we will be performing second order differencing of air temperatures for each site. In other words this means the difference of the first order air temperature difference between the current and previous timestamp.
First Order Differencing
Second Order Differencing

Datetime Features

In this section we will be working on engineering features out of the timestamps or date time element. To retain the cyclic nature of date time features, we will be mostly representing them in two dimensions using sine and cosine transformations. It is important to do so because if we consider time-based features in their sequential order, we would not able to reflect the true relations between different values. For example, intuitively 11pm and 12am (midnight) should be represented close to each other because we know that 12am occurs after 11pm. However, when represented as sequential numbers (0 & 23 hours) we are not able to preserve this relationship.

A mathematical explanation and visual representation can for the same can be seen from the below image.

Hour Data Transformation

As quite evident from the above plot, by transforming hourly data into two dimensions we are able to preserve the relationship between the various hours in a day. For example, 12 am is now adjacent to both 11 pm as well as 1 am.

Month Data Transformation

Similarly, these transformations can also be applied to represent days of the week and months because they also do follow these cyclical-like relationship patterns. A similar analogy for how the transformation of months would look like has been shown in the figure above.

We will be creating the following features from the timestamp column:

  1. hour_x: First dimension of hour feature will be represented using sine transformation
  2. hour_y: Second dimension of hour feature will be represented using cosine transformation
  3. day_x: First dimension of day feature will be represented using sine transformation
  4. day_y: Second dimension of day feature will be represented using cosine transformation
  5. dayofweek_x: First dimension of day of week feature will be represented using sine transformation
  6. dayofweek_y: Second dimension of day of week feature will be represented using cosine transformation
  7. month_x: First dimension of month feature will be represented using sine transformation
  8. month_y: Second dimension of month feature will be represented using cosine transformation
  9. is_weekend: Binary feature to represent if the observation relates to a weekday or weekend. If weekend we tag it is 1 else if it is a weekday we tag it as 0
Datetime features sine and cosine transformation
Functions to create datetime features

Building Features

So now that we have engineered the required datetime features, we will be working on the building metadata features. Before working on these features we will first split our training data into four parts based on meter id, i.e. we want to train four separate models for each meter type. While splitting the data based on meter id, we will also transform our target variable (meter readings) using the log function.

Divide data based on meter type and applying log transformations to meter readings

Building metadata mostly involves categorical features like building id, site id, and primary use. In order to encode these categorical features we will be following the leave one out target encoding strategy.

Leave One Out Encoding is similar to mean target encoding, wherein we encode each category by the mean of the target variable filtered by the given category. However an issue with naïve mean target encoding is that there is a possibility of data leakage and overfitting issues. Hence to avoid data leakage in Leave One Out Encoding, for a given observation within a given category, we replace it by mean of target variable from all other observations in that category ignoring the given observation while computing the mean. It can be mathematically expressed as below:

Additionally to make it more robust to overfitting and data leakage, we also add gaussian noise to the target encoding value.

Leave One Out Encoding example

To implement our leave one out encoding strategy we will be taking the help of the Category Encoders library.

Function to encode categorical features

Modelling

Now that we have analyzed and understood the data, performed the required data processing/cleaning and engineered features from the raw data, we can move forward with the modelling phase.

As we have already log transformed the target variable, in order to evaluate the performance of our models we will be using the RMSE metric.

For each machine learning algorithm that we experiment with we will be training four separate models for each meter type. A visual representation of the modelling strategy is as seen below.

Baseline Model

Because our performance metric is the squared error, where we do not have an upper limit for the error, we need to build a simple baseline model where we will predict all test meter readings as the mean of train meter readings. We can then compare performance of our actual machine learning models with the error of this simple baseline model. Only if the machine learning model performs better than this simple baseline model we would be able to conclude that our model has been able to learn some patterns from the data.

Code Snippet for baseline predictions

Uploading the baseline predictions on Kaggle resulted in the below performance on unseen test datasets.

Baseline Model Performance

Single Models

Under this section we will be training two simple machine learning models as follows:

  1. Linear Regression (SGD)
  2. Decision Tree Regressor

For each of these models, we will be tuning the hyperparameters using a 5-fold Grid Search validation strategy.

Function for hyperparameter tuning

After training the models on the best hyperparameters found during cross validation, we measured the performance on Kaggle’s test dataset and results are shown below.

Linear Regression (SGD) Performance
Decision Tree Regressor Performance

Ensemble Models

Now that we have experimented with simple single models, we will focus on training with ensemble based models. The strategy here for modelling will be similar, i.e. we will be training a separate model for each meter type. However one aspect that we will slightly change around is the hyperparameter optimization. Because our dataset is very large GridSearch CV would be too compute expensive for complex ensemble based models. Hence we will be tuning hyperparameters using a technique known as bayesian optimization.

Where GridSearch and RandomSearch CV fails is that they are not able to link previous iterations (of hyperparameter choices) with the future iterations. It is possible that they maybe spending a lot of time trying to evaluate bad hyperparameters. This is where bayesian optimization comes in because it just focuses on the most promising hyperparameters. Bayesian optimization utilizes past information to its advantage. And as a result it can comparatively save time when compared to other grid based techniques by just focusing on the region that matters.

To tune hyperparameters using bayesian optimization we will be relying on a library called Hyperopt.

Under this section we will be experimenting with the following well known ensemble algorithms

  1. Random Forest Regressor
  2. XGBoost Regressor
  3. LGBM Regressor
Objective Function for each algorithm to apply bayesian optimization

As seen from the above code snippet, we have defined separate objective functions for each of our ensemble algorithm. The objective of bayesian optimization would be to search through the parameter grid and find the optimal parameters which minimize the score or output of these functions. In our case the score to minimize is nothing but the mean squared error described earlier.

Once we have found the best parameters for each of our models, we use them to train each of the ensemble models mentioned earlier. The performance of these models on unseen data can be seen below.

Random Forest Model Performance
XGBoost Model Performance
LGBM Model Performance

Stacking Model

A stacking model uses the predictions of multiple base models to predict the target variable. In other words, the predictions of all base models are used as features on which a meta-model is trained. The below architecture diagram and mathematical expression represents how a stacking model works.

Architecture of Stacking Model
Prediction Equation of stacking model

For our stacking architecture, we will be using the ensemble models trained earlier Random Forest, XGBoost and LGBM as the base models. We will be using LGBM again as our meta model as shown below.

Stacking of Ensemble Models
Preparing data for stacking model

The hyperparameters of the meta regressor, i.e., LGBM are again tuned using bayesian optimization. Performance of stacking algorithm on unseen data can be observed below.

Stacking Model Performance

Summarizing Results

The performance on unseen data of all models that we have experimented with has been summarized in the below table.

     +-------------------------+---------------+--------------+
| Model | Private Score | Public Score |
+-------------------------+---------------+--------------+
| Baseline | 2.355 | 2.089 |
| Linear Regression | 1.437 | 1.196 |
| Decision Tree Regressor | 1.365 | 1.195 |
| Random Forest Regressor | 1.381 | 1.206 |
| XGBoost Regressor | 1.325 | 1.142 |
| LGBM Regressor | 1.315 | 1.135 |
| Stacking Model | 1.341 | 1.169 |
+-------------------------+---------------+--------------+

It can be clearly observed from the above table that all our machine learning models have better results than the baseline predictions. Based on this we can conclude that all of our models have been able to learn some patterns or correlations in the energy consumption data.

Overall, Light GBM has the best performance amongst all other models. Performance of stacking model seems to have deteriorated compared to LGBM regressor.

Future Work

Photo by Jungwoo Hong on Unsplash

We can further improve the performance metric by trying the following additional ideas:

  • In our feature engineering section, we can work on creating enhanced weather based features such as relative humidity which have an affect on energy consumption levels
  • We have currently only looked at approaches involving machine learning algorithms. Additionally, we can look at exploring various deep learning approaches.
  • Currently, for each machine algorithm we are only training 4 models (one for each meter type). We can look at the possibility of training multiple models for each distinct building in our dataset.

--

--