Predicting household energy consumption to identify efficiency opportunities

Helping Americans conserve household energy and save money

This blog was written by Aditya Bhide, Alberto De La Barra, Andy Chen, Yusuke Nobuta, and Ziad El Assal as part of the “Analytics in Action” course at Columbia Business School.

Climate change is the most critical problem currently faced by mankind. We collectively emit annually about 34 billion tons of Carbon dioxide into the atmosphere. Of this, ~7% or 2.4 billion tons is attributed to heating, cooling, and refrigeration. While the world is moving to electrification and clean energy, renewables still need technology breakthroughs for all weather use. Thus, improving energy efficiency is key to combat climate change.

Sealed is a company that helps tackle this problem by making American households more energy efficient. It takes a home’s characteristics and energy history data to create a custom model for the house that predicts energy savings. It then deploys a home upgrade plan consisting of insulation and HVAC upgrades, charging the customers only when energy savings are realized.

Currently, Sealed cannot market its product to those households without energy history data, usually made up by families that have just bought or moved to a new house and they do not have the past owners electricity or gas bills. We, a team of engineering and MBA students at Columbia Business School, designed a solution that helped Sealed tackle this problem, through the Analytics in Action course. We created a new model that uses house characteristics (such as square footage, number of people, rooms etc.) and historical energy bills as training data to predict energy consumption for any household by using house characteristics as the only input.

Let’s dig further into the problem.

A home’s annual energy usage is divided into 2 types:

1) Shoulder Usage: Consumption that would occur irrespective of outside temperature;

2) Weather Usage: Consumption either for heating or cooling driven by outside temperature.

To quantify weather usage, we use Heating Degree Days (HDD) that quantify how cold the temperature was during a period of days compared to a standard temperature (65°F in the US) Equivalently, it tells us by how many degrees we should increase the temperature by heating to reach the standard temperature (hence heating degree days). For example, a temperature of 35°F during two days would represent (65°F — 35°F) * 2 days = 60 HDD.

We also use Cooling Degree Days (CDD), which is basically the same idea with HDD and quantify how hot the temperature was compared to a standard temperature.

It is critical for Sealed to quantify a house’s energy consumption given a certain HDD. But how do we do that?

Empirically, homes’ energy usage are linearly correlated with HDD:

Energy Usage = α + β × HDD;

The β coefficient is the one of interest. It is specific to each house and it answers the following question: if the outside temperature decreases by 1 degree during 1 day, by how much will the heating energy consumption increase?

The α coefficients means shoulder usage of each house.

Then let’s take a house’s energy bills and run a linear regression of the energy usage history against the HDD history, and we’re done, right?

Well in some cases, yes. But the energy consumption history is not always available. Think about a family who just moved into a new house and wants to refine it right away.

Therefore, the goal is to estimate β without energy usage history, but using only house characteristics. These include square footage, number of bedrooms, thermostat temperature, age of the house, etc.

The data we have includes:

  • Houses characteristics
  • Houses energy bills
  • History of past temperatures by ZIP code

We built a two-stage model.

Stage 1: Determine the β coefficient for every house in the data set that has an energy usage history. For that, we only use HDD and energy consumption history. We now have a table that looks like this.

Stage 2: Use this dataset to ultimately reach our final objective — predict β for houses without energy usage history, using only home characteristics.

β = f(Home Characteristics)

First, we analyzed both the first and second stages with a simple linear regression, and found that although the first step had an R-square of 0.75, only one regression was conducted for each customer, and some customers had as few as 10 data points. Therefore, it was considered necessary to conduct regressions with more data points. The second step resulted in a R‐square of 0. This suggests that linear regression is too simple to be applied to the second stage and that a more complex model needs to be considered.

Regarding the first step, we considered clustering customers (houses) with the same characteristics to analyze with enough data points compared to the previous linear regression. Clustering can be based on geographic division by zip code or on house characteristics such as the number of rooms or number of occupants, but this time we decided to use the value of β obtained in the first step as the criterion for clustering.

The higher the value of β, the less efficient the use of heat is, and the lower the value of β, the more efficient the use of heat is, so we thought that β itself could be considered as an indicator of house size and insulation efficiency.

The problem here is what value should be used as the basis for clustering. To solve this problem, we used a Bayesian hierarchical model. This model can cluster α and β together in such a way that they do not become unreasonably high.

While OLS produced some negative values of α and β, the resulting BHM was able to estimate positive α and β, as shown below.

Regarding the second step, we examined models by using a combination of methods such as XGBoost, Bagging, and Random Forest to make sure that the model is valid across different train-test splits.. As a result, Bagging with XGBoost of Random Forest was the best model.

As a result, we were able to create a model with the best prediction performance in two steps: clustering with BHM and Bagging with XGBoost of Random Forest. (For example, the average R-square was 0.25 in the case of gas usage.)

Using this model, it is now possible to estimate the energy usage of customers for whom no historical usage data is available, allowing Sealed to approach customers that it could not approach before.

--

--