What do party schools and energy efficiency have in common? A lot when it comes to identifying commercial real estate ripe for renovation.

David Anderson
Feb 25 · 8 min read

This is the second in a four-part series exploring ways that machine learning can help drive insights from public data on building energy use.

It’s that wonderful time of year when high school seniors are getting those fancy letters from colleges and are getting ready to make the Big Decision. When it comes to choosing a college, we all know that choosing the best school is all about finding the right balance between personal and academic growth.

Okay…now that I've written that last statement with a semi-straight face, I can say that the most important factor for most rising freshmen in this process is which school will be the most exciting to attend. And I really don’t blame them — we need to make going into lots of debt fun again.

With that in mind, it’d be helpful to have a way to figure out which school is really a party school. Various publications have rankings and our friends have their Instagram posts, but anecdotes alone don’t count as analytics. You can make #lablife appear more exciting than it is with the right filter, regardless of university. The ultimate question is: Can you objectively define (not just describe) what a party school is?

This scenario is one that comes up a lot in life. It’s the “I can’t define it, but I can identify it when I see it” scenario. In these cases, there are usually a set of features we don’t explicitly pay attention to when they are presented to us one-by-one, but that when we collectively observe them, we are able to create a label for that combination.

For party schools, most folks attempting to define one may say something along the lines of “has a good social scene.” The odds of hearing anyone mentioning bars or shot glasses sold per capita as a criteria are likely low — even if most schools that individuals label as the “party” ones can be reasonably identified using just those criteria.

The practice of finding this latent set of characteristics is a critical part of what’s called feature engineering in the data science space. Feature engineering is a fancy way of saying finding a systematic way to take lots of variables, and creating different weights and/or combinations of them that matter in terms of being able to classify or predict an outcome.

In the energy efficiency world, the party school question gets re-formed as “Can I identify a building as an energy hog?” Even though public data on buildings’ energy use (e.g. Energy Score) exist and are becoming more transparent, as we saw in the last post on building data in Atlanta, only a small percentage of buildings actually have that data. To answer this question, we need to do two things that seem counter-intuitive:

  1. Add more variables to the building data we do have
  2. Reduce those variables to the set of them that matters

This post will focus on the first of those two exercises.

Some tools that are useful for connecting the dots

Beefing up to our energy efficiency data with demographic variables is a very useful exercise that, fortunately, can be done for almost no cost.

One of the best tools for attaching large demographic datasets to your own data is Google’s BigQuery platform. There are lots of useful primers on BigQuery, but at a high level, it lets you easily get demographic, financial, and other data from its public datasets at the geographic level you are interested in. These public datasets include the Census Bureau’s 2017 American Community Survey (ACS). Most importantly, all the public datasets fall into a free tier of usage, which means you get for free up to one 1 terabyte of data processing each month when using its public datasets.¹

As discussed in an earlier post, there are 419 buildings that reported data in 2018 under the Atlanta reporting mandate. To get demographics for the zip codes where these buildings are located, we first need to extract the zip codes from the addresses downloaded from the Atlanta’s Building efficiency data website.² Once that cleanup is done, we’re ready to upload the cleaned data to BigQuery.

Since we only care about those zip codes in Atlanta for those buildings in the efficiency database, you can use the SQL query below in the BigQuery editor to retrieve the 2017 ACS data for only those zip codes.

SELECT distinct
`bigquery-public-data.census_bureau_acs.zip_codes_2017_5yr` acs_data,
`atl_energy_score_data` bldg_data
bldg_data.zip_code = cast(acs_data.geo_id AS NUMERIC)

Note: atl_energy_score_data is the name I gave to the uploaded Atlanta data, so you will need to replace it with your own file name.

Once you’ve got your filtered table, you can extract whatever demographics you like into a convenient spreadsheet or into a data visualization tool like Google DataStudio (integrates seamlessly with BigQuery) or Microsoft’s PowerBI.³

So what did we find out?

High level visuals reveal some interesting data at the building level. The 419 buildings that submitted 2018 data represent 100 msf of space, the vast majority of which is in office, mixed use, or lodging buildings. Of this 100 msf, only 68 msf actually reported an Energy Star Score.

Data Dashboard for Building Data

A natural starting point would be to see if there are any meaningful differences between the ~70% of buildings that do have a score and the ~30% that don’t have one.

Data Dashboard for Building Data: With and Without Energy Star Score

Looking at just office and lodging, it turns out that both sets were similar in age, but buildings without an Energy Star score tend to be smaller.

Building data gets us part of the way there, but what makes things more interesting is seeing whether the demographic data we pulled using BigQuery can tell us more.

Fortunately, it does.

The table and chart below show key zip-code metrics I pulled using BigQuery, broken out by property type:

In general, office, multifamily, and lodging buildings tend to be in higher income, well-off areas.

Data Dashboard for Demographic Data

Even when we separate buildings with and without Energy Star scores within those sectors, buildings which don’t have scores tend to be in areas where socio-economic conditions are worse, commute times for residents are longer, housing is more expensive, and unemployment is higher.

Data Dashboard for Demographic Data: With and Without Energy Star Score

To help visualize this, take a look at the map below showing median Energy Star score (represented by height) and median HH income (represented by color). Flat dots don’t have an Energy Star Score and are concentrated in the southwest quadrant of the city.

Map of 2018 Atlanta Building Efficiency reporting data: Median Energy Star score (represented by height) and 2017 ACS Median HH income (represented by color)

This isn’t really surprising if you know the neighborhoods and office submarkets of Atlanta, but what is pretty surprising is that when you limit the universe of buildings with Energy Star data to those that don’t score so well, the demographics converge! Specifically, for office buildings with Energy Star scores below 70 (FYI: a score of 75 is required to be certified as an Energy Star building), the demographics are strikingly similar to those buildings that don’t have data!

Data Dashboard for Building Data: Energy Star Scores Below/Above 70

How does this data help us become better investors?

Unsurprisingly, the act of energy efficiency reporting reflects a self-selection bias: “A” students have always liked to let others know their grades. The good news is that adding a wider set of variables may present a way to overcome this.

It could be the case that demographic variables in the neighborhoods where non-responding buildings are located are simply capturing attitudes of indifference of tenants and landlords toward energy efficiency in particular buildings. On the flip side, it could also mean that demographics are a proxy for neighborhoods where the typical landlord hasn’t taken steps to make such properties a meaningful part of the inventory tenants are able to even choose from.

Both are two sides of the same coin, but the major takeaway is that data on commutes and income income may be a contender when it comes to flagging buildings without reported Energy Star data as ones with higher renovation potential.

Whether these buildings are good candidates for value-add renovations is a function of submarket conditions and other factors, but like every kid on Halloween quickly learns over the years, having data on which are the best doors to knock on ultimately leads to more candy.

While high-level findings like these are evident just by using tables and maps, we are far from being able to say we’ve found a predictive mousetrap. Unearthing more complex relationships is impossible without the type of feature engineering tools data science can offer. Existing efficiency data by itself needs to be more complete and has limited ability today to inform which of these data points are useful for predicting a building’s score.

To get to that point, we need to first see whether other variables beyond just “has/doesn’t have Energy Score data” help surface interesting observations. For that we need a tool called Principal Components Analysis (PCA). In the next blog post, we will use PCA to determine which combinations of demographic variables help us distill all the variables down to a few composites of them that matter (the quintessential 80–20 rule).

Until then, let me know if you run across a “shot glass per capita” dataset I can use.


1. To put that into context, the Hubble Space telescope generates 10 terabytes of data each year. Unless your real estate investment market selection strategy involves acquiring galaxies, BigQuery’s free tier is probably sufficient for most real estate applications.

2. You can use the regular expression code snippet re.match('\d{5}$,address in a Python script or Google’s DataStudio to automate this process, but I chose to simply use text-to-columns features in Excel for this case since there were some mal-formed addresses that didn’t have zip codes at all.

3. I’ve saved the SQL queries I created to extract the socio-economic, housing market and labor market metrics.

Building Data Needs Love Too

A blog about how data science can make real estate more…

David Anderson

Written by

Founder of BuildPayer. Long-time real estate professional with passion for sustainability. Brooklyn-born and raised, but couldn’t even win a tickle fight.

Building Data Needs Love Too

A blog about how data science can make real estate more sustainable

More From Medium

More from Building Data Needs Love Too

More on Energy Efficiency from Building Data Needs Love Too

More on Energy Efficiency from Building Data Needs Love Too

Listen to the Holy Fool

More on Energy Efficiency from Building Data Needs Love Too

More on Energy Efficiency from Building Data Needs Love Too

Closing a real estate deal is like going on a family trip to Disney

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade