Building Energy Usage Prediction

A meandering exploration in time series data using Facebook Prophet

Published in

The Startup

8 min readFeb 25, 2020

Intro

Data is being generated at an unprecedented rate. For data scientists, this represents an excellent opportunity to uncover insights that would have otherwise been overlooked. In this project, I decided to challenge myself by looking into a massive time-series data set. I’ve always been interested in the energy sector, whether this involves emerging innovations or established industries, so I began my data search there.

Data Set

I found an interesting data set on Kaggle provided by the American Society of Heating, Refrigerating, and Air-Conditioning Engineers (they’re a real hit at parties). The general purpose of this data is to predict the energy usage a of building given its unique metadata along with the associated weather of the area. The data was split into three unique CSV files: one for the building energy usage over time, a second containing unique features of the various building, and a final file containing weather data over the course of a year.

Merging and Clean-up

Lets dive deeper into the data set. My ultimate goal is to predict existing (and potentially future energy usage) based on both weather and building features. The first problem we run into is that the relevant data is stratified across three CSV files. One contains the relevant values to be predicted while the other two contain wildly different types of contextual features. They do however, contain relevant commonalities that can be used to link them together.

An artist’s rendition of the overall data structure

We can visualize the overall structure of the data in the above image. The data set contains 1449 buildings (blue squares) located within a region (grey blob). These buildings are scattered across 15 distinct site locations (green circles). Each building has a unique 0–1448 identifier along with a 0–14 site Id. In addition to containing the energy meter reading for every building on an hourly basis, the data set also contains the hourly weather data for each site over the course of a year.

The data set is preposterously large; however, we’re going to make the data set even larger. After importing the CSV files and converting them to python data frames, let’s merge the meter reading data (the feature we are trying to predict) with the building feature set with the common feature being the building id. Now, we can merge this data set with the weather data; however, we need to ensure that each building gets the weather associated with the correct site id. The weather report also has to correspond to the time of the meter reading. This is why we merge on both the ‘site_id’ and ‘timestamp’ features.

#Merging meter reading with building metadata
train_building = train.merge(building, on='building_id')#Merging above data frame with weather
train_building_weather = pd.merge(train_building, weather, on=['site_id', 'timestamp'])

So what is the final output of these operations? A data frame with over 20,000,000 rows that contains the meter reading and weather report for every building for every hour over the course of an entire year.

This data set has some series issues however, namely copious amounts of nonexistent, or null, values.

An avalanche of null values, stratified by feature column

This will require some work. Lets take a look at each column in turn. The features ‘year_built’ and ‘floor_count’ both have more null values than actual values, a staggering 12 million and 16 million respectively. I doubt there is any reasonable way to clean these columns: there is simply too much missing data. As such, we can simply drop the columns.

‘Air_temperature’, ‘dew_temperature’, and ‘wind_speed’ all look promising. A vast majority of the rows contain valid values: it would be a waste to throw them out. Instead, lets drop the relevant rows, which should have a negligible impact on the overall data frame size.

Now for the middle children. What can we do with ‘cloud_coverage’, ‘precip_depth_1_hr’, ‘sea_level_pressure’, and ‘wind_direction’? Rather than simply dropping the value, we can replace them with the mean of the other values in the column. While this may not be ideal, it preserves the trove of rich, quantitative data embedded in the columns. This is not quite possible for the wind direction feature, which is represented in the form of a compass vector. While there may be a way of assigning an “average” direction, this did not seem appropriate in this case. The column was simply dropped.

Visualizations

Now that we have the data processed, let’s create some all-important visualizations. Enjoy the montage:

Relative building square footage distribution: Skewed right in the parlance of statistics

Building count by primary use: What the US educational systems lack in quality, it makes up for in quantity

Meter reading vs air temperature: Lets all pretend we can see that slightly positive, direct relationship

Meter reading vs square feet: Only the happy few can live in grandeur

Temperature over time: Who bothers labeling axes anyways?

Average building energy usage for location site 3: March 13th and November 6th were days to remember

Modeling

With the visualizations out of the way, lets turn to the most exciting component of the process: producing a predictive model. Since we have the data formatted almost perfectly, lets feed it into a scikit-learn model, a popular python machine learning library. First we need to transform the categorical feature ‘primary_use’ into numerical values using dummy variables (an automated form of one-hot encoding).

from sklearn import preprocessinguse_encoded = pd.get_dummies(train_building_weather['primary_use'])train_building_weather = pd.concat([train_building_weather, use_encoded], axis=1)
train_building_weather = train_building_weather.drop(columns='primary_use')

Now, lets use a straightforward multivariate linear regression because we’re dealing with only quantitative data. We can use a train-test split to break apart the data set into two sets: one that will fine-tune the model parameters through training and a second set onto which the model will be applied to test the accuracy of its predictions. Notice that certain relevant features are dropped, such as the actual meter reading (the model can’t know the values it is trying to predict), the timestamp (the linear regression model does not understand time series complexity), and the relevant Ids. In other words, the model is learning purely based off the building features and the weather information at the time of the meter reading.

from sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(final.drop(['meter_reading', 'timestamp', 'building_id', 'site_id'], axis=1), final['meter_reading'], test_size=0.30, random_state=42)

Now all we need to do is input the training data into the model and output our predictions.

from sklearn.linear_model import LinearRegressionLinRegression = LinearRegression()
LinRegression.fit(X_train, y_train)predictions = LinRegression.predict(X_test)

Let’s take a look out our performance:

from sklearn.metrics import r2_scoreprint('MAE:', metrics.mean_absolute_error(y_test, predictions))
print('MSE:', metrics.mean_squared_error(y_test, predictions))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, predictions)))r2_score(y_test, predictions)

Absolute garbage. Only ~24% of the total variance in the data is encapsulated within the function model. This is a properly dismal result. We could try another more advanced supervised learning technique such as a MLPRegressor; however, there would be no point. Apart from taking a ridiculously long time to train (a simple lambda expression over a column leads to memory errors due to the shear size of the data set), this approach is ultimately futile.

When initially diving into this data set, I am ashamed to say that this was my intended approach. Use the timestamps to merge the data sets into a mega data frame, ditch the time column, and pump the whole thing into a standard supervised model that outputs a continuous predicted value. Unfortunately, this simply cannot work. The data is too erratic, interconnected, and, most importantly, time sensitive. In other words, I had forgotten that this was time series data. The relative temporal positioning of the data is its most important feature. The past and current values for building energy usage directly inform future values.

Fundamentally, this requires a different approach

Time Series Modeling

Rather than using regressor, lets try a model specifically built to tackle time series data. Writing a custom implementation would be a quite a feat; fortunately, the bright people at a company called Facebook have created an open-source, additive series prediction model called Prophet. This model finds relationships in the data stratified across a series of timescales to predict values into the future.

Lets test out the model first by simply imputing the daily temperature data and extrapolating to one year in the future.

import fbprophet 
# Create a dataframe containing the timestamp and air_temperature
wp = wp.rename(columns={'timestamp': 'ds', 'air_temperature': 'y'})# Instantiate model and fit on the data above
wp_prophet = fbprophet.Prophet(changepoint_prior_scale=0.05, yearly_seasonality=True)
wp_prophet.fit(wp)# Instantiate a dataframe for the predictions 
wp_forecast = wp_prophet.make_future_dataframe(periods=365, freq='D')
# Create predictions
wp_forecast = wp_prophet.predict(wp_forecast)

We can now graph these results to better visualize them:

wp_prophet.plot(wp_forecast, xlabel = 'Date', ylabel = 'Temp')

There’s a lot happening in this graph, lets break it down. The black dots represent the actual temperature values in the data set time interval. The blue line represents the value predicted by the model. The light blue region represents the upper and low bounds of the uncertainty the model has assigned to these values. Overall, it seems like the model is doing a good job. It seems to have understood the overall temperature profile associated with the changing of the seasons. Of course, there is no way of verifying the validity of those predictions without the associated data. Furthermore, as fbprophet is a pre-built model, its inner working are a bit of a black box. That being said, its ease-of-use means that it serves as an excellent starting point for this type of forecasting.

Lets use this same technique to predict the average energy usage of all the buildings in site 3 two years into the future (when visualizing the data, an individual building’s energy usage proved too erratic to output a nice continuous prediction.

Site 3 average building energy usage predictions

Again, visually the model seems to be doing a good job. Overall, it predicts that the energy usage will follow the same overarching oscillating pattern; however, the local maximums will steadily decrease.

Conclusion

How can we expand on these predictions? How about using both weather patterns along with the existing meter readings to predict future meter readings. In this case, the time-series weather data would be know as an added regressor or exogenous variable. They provide additional context for the model to fine-tune its predictions. Currently, fbprophet does not support using added regressors for producing future predictions, as this would require jointly forecasting more than one variable in parallel. You can, however, use known values to augment the prediction of another unknown value.

For example, we know both the weather and energy usage data over the course of the year. Rather than using both features to predict the energy output for the subsequent year, you can use the first half of the energy data along with the full yearly weather report to predict the second half of the year’s energy usage.

This seems like a promising direction; however, as I am full-time college student, I will put this idea on hold for now in order to return to work I should actually be doing. Check back for a future edit...