Improving Energy Efficiency on Penn’s Campus

--

And How Big Data + Machine Learning Can Help

By: Alex Waegel, Hassan Hammoud, Stephen Rothstein

Intro

US buildings constitute 30% of all carbon emissions through their heating and electricity consumption, making them arguably the largest source of climate change globally. In the past few years, this issue has come front and center, with greater importance placed on energy efficiency within both new builds and existing buildings. Novel regulations in 2019 and 2020 have catalyzed the energy efficiency industry such as Local Law 97 in New York, which requires all buildings over 25k sf to drop emissions 40% by 2030, and the European “Renovation Wave” regulations — which require a 60% reduction of carbon emissions to buildings over the next decade (along with an 18% reduction in heating and cooling demand). Clearly, this topic is timely and prescient. Because of the large amounts of data that buildings generate, they are also one of the best candidates to leverage machine learning to more efficiently manage building energy usage. Our group explored this thesis using historical energy datasets for over 100 buildings on campus at the University of Pennsylvania.

Code for this analysis can be found on GitHub

Buildings contribute 28% of carbon emissions in the US

What We Want to Accomplish

Our goal is to predict a buildings energy consumption given the outdoor weather conditions. We will do this by preprocessing and analyzing a data set of over 100 buildings on Penn’s campus, removing outlier data, and selecting the 10 most complete building datasets available. Specifically using the features for electricity consumption, chilled water consumption, and steam consumption across time features (hour, date, year) joined with our historical weather features (temperature, rain, wind, relative humidity) over a building’s lifetime, we can predict future energy consumption based on a given weather forecast. This can give us insights as to when buildings are not performing in-line with how they have trended historically (fault detection) and can help us find opportunities for future energy saving opportunities by combining this information with what we know about the habits of our building’s occupants (i.e. when do they like to come to the building, when do they work from home, what is their desired work temperature, etc.) We ran 60 models total for this project which consisted of a linear regression model on 10 buildings across 3 different outputs (electricity, chilled water, and steam) and a random forest model on 10 buildings across the 3 stated outputs.

Over the long run, this type of energy analysis can really move the needle toward reducing carbon emissions across the building sector and the world at large.

Photo Cred: Greentech Media — Operational Efficiency: A Hidden Energy Efficiency Opportunity for Commercial Buildings

Overview of Data

The raw input for this project is energy data from the University of Pennsylvania that was gathered by hand from graduate students at the Center for Environmental Building and Design. The data could only be accessed by downloading excel files containing a single day of energy data in 15 minute increments. Each daily file contains the consumption data for one of Electricity, Steam, or Chilled Water for all the buildings on campus that receive those energy carriers along with a timestamp. Not all buildings have steam or chilled water and some buildings that have electricity, steam or chilled water may be serviced from another building and thus may not be metered for one, but still report on others. These daily files were combined into monthly files, leaving in the headers for each daily segment and separated by an empty row. The preprocessing consists of taking this raw data and creating separate .csv files with the data for each building. Following that stage, the individual files are scanned for outliers, which are removed. The final stage of the preprocessing is to take the outlier removed data and to join it with weather data using the timestamp. The weather data used was purchased by the Center for Environmental Building and Design. It is, however, in hourly increments, so prior to joining the energy data it had to be resampled to hourly means.

The result of these preprocessing steps resulted in separate data frames for over 100 buildings on Penn’s campus, each one with features as depicted in the example dataframe below from the Annenberg building, one of the ten buildings which was ultimately chosen for further analysis due to the high quality of the dataset.

Preprocessed dataframe for the Annenberg building, one of the ten buildings which was ultimately chosen for further analysis due to the high quality of the dataset.

Exploratory Data Insights/Charts and Visualizations

Now that the data has been preprocessed, some exploration of the data can take place. This section will examine both the energy data and the weather data individually, as well as combined. The energy data consists of over 100 buildings and even after preprocessing, not all the buildings are suitable. The primary reason encountered is missing data. The meters at Penn are imperfect and, especially in the early years of a building’s life, could be missing weeks or months of data at a time. Further, the naming convention of the meters change over time, as buildings are renamed or repurposed leaving some meters missing years worth of data. In some instances, one meter of the three may be identified by a different building name than the others, so that the data was not joined. Finally, in some cases where one building receives its energy through a parent building, the child buildings energy was not metered until part way through the timespan, causing a sudden drop when it is removed. By converting the pre-processed data into time series visuals, we were able to identify those buildings more easily with insufficient data sets.

The above image represents 3 out of the 100+ buildings analyzed. The first column has electric data, the second column has chilled water data, and the third has steam data. The building represented in the first row has high quality raw data; the building represented by the second row shows an example of a building dominated by a few outliers. And the final row shows a building with significant missing data
This image show the same 3 buildings after the outliers have been removed. As can be seen, the second building looks much better after outlier removal, while the third building is still missing too much data to be useful, although it is improved.

After analyzing our time series data and adjusting for outliers, we chose 10 buildings on the Penn campus that we felt had the data quality requisite for us to develop our models. The 10 buildings include 1.) The Annenberg Center 2.) Blockley Hall 3.) BRB1-Stellar Chance 4.) Charles Addams Hall 5.) Fisher Fine Arts and Duhring Wing 6.) Grad Towers B- Sansom West 7.) Johnson Pavilion 8.) Penn Museum 9.) Singh Nanotechnology 10.) Vagelos Labs

The top graphic below demonstrates the energy consumption patterns across time for our highlighted example, the Annenberg building. We can see how energy consumption across electricity (left column), chilled water (middle column), and steam (right column) fare over time and how much energy they tend to consume on average over a given month. This can give us insights going forward when buildings are not performing in-line with how they have trended historically and can help us spot opportunities for future savings. The bottom graphic shows the weather conditions over time that are joined with the building data.

Energy consumption time-series for the Annenberg building
Weather time series data for Philadelphia, which is joined with out building feature data

Model Description

As you can see from the plots above, the data is seasonal, especially temperature. The first step in model-building is to account for that seasonality. Below is a before and after visual of how our temperature data looks before and after accounting for seasonality in our model.

Temperature data before (left) and after (right) adjusting for seasonality

Linear Regression

The first model we ran was linear regression on each of the 10 buildings, across our 3 desired outputs (electricity, stream, and chilled water). These models did not perform that well as indicated by the low explained variance score. Below are the results of the linear regression models for the Annenberg building. We initially thought a potential way to fix this problem would be through PCA or principal component analysis, a dimensionality-reduction method that allows us to limit multi-collinearity in our data and pick the features that explain the most variance. However, after studying the covariance plots for our features (reflected below), it became clear that there is very little collinearity in our model features as evidenced by the fact that of the 42 total correlations only 3 are greater than .3 (1). PCA was ultimately not performed for this reason.

Linear Regression model results for Electricity, Chilled Water, and Steam for the Annenberg building
Covariance matrix used to assess effectiveness of PCA

Random Forest Model

We were optimistic that a random forest model could work well for this problem because there are several different features at play and various decision nodes that the model could take, so averaging them out could yield strong predictive power. Since random forests are scale-invariant and we are not using PCA, it would be best practice to use the unscaled data that only has the deseasonalization transformation applied. Thus, we can interpret the Mean Absolute Percentage Error as how far off the model is on average in its predictions. Below are our results for the Annenberg building Random Forest models across electricity, chilled water, and steam.

Electricity Random Forest Model for the Annenberg building
Chilled Water Random Forest Model for the Annenberg building
Steam Random Forest Model for the Annenberg building

Results/Interpretation

First, we can see that the R² and accuracy for the random forests are high. While random forests are a bit more black-boxed than a decision tree, we can still see the predictive power. Since this was not captured in the regression models, we conclude that maybe energy usage should be modeled as a set of conditions, rather than directly correlated variables. For example, electricity usage may only peak when the outside temperature is above/below a certain level, rather than a linear relationship. This matches with the intuition that heating/cooling systems may be tripped off by external weather monitors. The university only turns on heating/cooling for dorm buildings once the average temperature reaches a certain level, for example.

Also, the MAPE (mean absolute percentage error) and MSE for all these models is low, even though the outputs are not scaled down. This implies that these forests can model the outputs with decent predictive power. An additional thought is that some outliers are hurting the residual values predicted by the model and thus dropping the R² while not affecting the MAPE as much, and the deseasonalized graph does show some large outliers to support that idea.

From the data, we see that the Johnson Pavilion and Museum steam models have a particularly high RMSE value. One conclusion we can draw is that it may be useful to have the steam gauges checked and/or to recommission the steam valves in these buildings. It is possible they are inefficient, faulty, or too old to function properly, which is leading to such an error in the model!

Random Forest Model (Penn Museum)
Random Forest Model (Johnson Building)

Ethical Considerations

COVID-19 has shed light into how indoor building conditions affect the health and well-being of its occupants. This is true for both indoor air quality and ventilation as well as building comfort and temperature levels. Humans clearly need to have their buildings heated during Philadelphia winters, but at the same time building heat is one of the top global contributors to emissions. So it is imperative ethically that if we are going to heat our buildings, we do so in the most efficient manner possible. This means that it is unacceptable to have a steam heater that is past its useful life and using too much energy than it should. We have proved that data science and machine learning are two technologies that are relatively inexpensive and highly accessible to every building owner and manager, and that these technologies should be deployed to mitigate the ethical and environmental damage caused by wasteful buildings. It also can save money in the long run, making these types of data analyses on buildings a no brainer!

Conclusions

Overall, we have learned a lot from this model-building process and have demonstrated that weather data is useful for predicting energy usage in buildings across campus. Many third party company’s have emerged over the past decade who specialize in doing just this. Of course, aggregating the data remains the key imperative and obstacle to overcome. As part of the mass-retrofit initiative that will need to take place to meet regulations, low cost sensors will need to be deployed throughout the existing building stock, and this will help over time to generate the data needed to create better predictive power towards energy efficiency at scale. Lastly, our analysis might have been improved in the following ways. Firstly, if the data for all 100 buildings was complete and went further back in time. Secondly, it could be really helpful if we had comprehensive building occupant data which could enable us to make more concrete recommendations to the tenants for how they could contribute towards reducing emissions. Lastly, it would be a very interesting analysis to compare the 10 buildings in our data set to other comparable buildings in age and quality across numerous cities in the northeast. This would enable us to learn whether there may be other faults at play in terms of how the buildings are being managed.

--

--