Sequencing the DNA of Real Estate: An AI-Driven Approach for Comparing Assets

8 min readJul 16, 2018

During the course of a dinner with the real estate vanguard, you’re bound to hear about the importance of location, location, location. It won’t be long before size and age of the asset come up, either.

Indeed, when analyzing a potential real estate deal, the analysis usually starts with establishing a peer group consisting of similar properties of similar size and age, all in the same area where the subject property is located.

The logic behind comp analysis is that similar properties should behave the same: if the building across the street sold for $X, and your building is roughly the same as that building (similar size, year built, rents, etc.), then your building should sell for roughly the same price, with some adjustments where needed.

Sequencing Real Estate DNA: 3 views of geo-spatial data layers over San Francisco. Source: Skyline AI

Should we really consider properties to be good comparables only if they are located in the same area?

What if they are located in different states? Can they be compared?

Consider two rental properties, one in Atlanta and one in Dallas. The Atlanta asset is comprised of 200 units and the Dallas property is 25% smaller — 150 units.

Both properties share the same property manager. They are both 95% occupied by software engineers who commute to work for 20–30 minutes using their private car. The tenants spend one hour per week on average jogging in the park next to their apartment complex. Both properties are the only garden-style multifamily assets in their area.

When two properties appear to have the same “property DNA,” AI-driven analysis can come up with a great deal of insights concerning the future performance of one asset based on the performance of the other.

For example, what if it turned out that making the Atlanta property dog-friendly, while replacing its old and abandoned business center with a pet store, significantly increased the property’s online “traction”? Remember, the property in discussion happens to be 95% occupied by software engineers, who like to express their views online. After making the change, the property’s review score bounced from 3 to 5 stars on Google Maps, Yelp and ApartmentRatings.com — and in turn, this proved to help maintain occupancy with rent increases.

Effective rent by property for June 2018 with 12 month change indication. Radius of circle at lower zoom levels and height of stack at high zooms levels represent effective rent, colour is change over 12 months.

Investors would benefit from having this kind of knowledge when considering the Dallas property, right? Well, the difficulty, of course, is in applying knowledge from hidden, deep correlations across tens of thousands of attributes when considering asset similarity. Fortunately, this is exactly where certain types of Artificial Intelligence algorithms may be applied to create enormous value beyond the reach of any human being, even if that human is equipped with multiple asset reports from traditional commercial real estate analytics companies.

Unsupervised Learning Algorithms, Using Artificial Intelligence for Clustering Comps

Machine learning, a fundamental concept of AI research since the field’s inception, is the study of computer algorithms that improve automatically through experience. Machine learning models are trained using datasets of historical data with the aim of achieving certain tasks: prediction of values, classification, anomaly detection, and more.

Sale price (dollars per unit) by Virtual Neighbourhood

A subclass of machine learning is the set of unsupervised learning algorithms. Contrary to supervised learning algorithms, which learn from historical datasets that are “tagged” according to some logic or target, unsupervised learning algorithms learn from “unlabelled” data — that is to say, the model is asked to perform tasks over the data without having any a-priori knowledge about it. Instead, the algorithm finds correlations between the features comprising the dataset: topographical properties, cash flow data, transaction history, financing terms, data about restaurants and bars, workplaces, commute times, and more. Understanding these types of deep relationships between so many factors over a period of decades is beyond the processing ability of even the most capable human analyst. Moreover, these correlations often go against so called “common sense,” and so they are never sought after.

Skyline AI Virtual Neighbourhoods

Skyline AI uses the most comprehensive commercial real estate data set in the industry, mining data from over 100 different sources, analyzing over 10,000 different attributes on each asset for the last 50 years. Powered by natural language processing and high-performance data infrastructure, all data is compiled into one large “data lake,” and then cross-validated to make sure the data used is accurate.

This enables our data science teams to design and train an ensemble of machine learning models, including unsupervised ones, in order to form “Virtual Neighbourhoods” — clusters of properties deemed similar according to thousands of different signals in the data, some of which are represented by deeply hidden correlations, and may be used in advanced comp analysis.

Say that we are looking at a multifamily property in the Atlanta-Fulton submarket. We would like to construct a peer group and perform rent comp analysis to assess the properties’ current condition when it comes to rent and value add potential.

Following a traditional comp analysis, we let an experienced human real estate analyst source the comps. The analysis ends up containing 36 comparable assets, all within a radius of 8 miles from the property, roughly from the same property class (from B — to B++), built between certain years (1999–2018), with certain occupancy levels, and so on.

Rule-based peer group construction as performed by a human analyst: 36 comps

The comparison in this case is for two-bedroom units, and we can see two charts showing us where our property is located compared to its peer group. On both charts, we construct a grid system showing us the location of our property (the yellow dot) vs. its peer group (the blue dots), where the left chart shows rent per unit vs year built and the right one rent per square foot vs year built.

But where would we expect our property to be located based on those comps? One way to visualize this is by leveraging linear regression. During this process, our previous observations, like the rent values of the comps, are plotted against some other dimension (for example, the year built). Then, using an iterative algorithm called gradient descent, we fit a line through those samples (“The fitted line”) as such that an error function is minimized (the error representing the distance between our line and the peer group rents).

Once the regression is done, if we draw a vertical line between the yellow dot (representing our property) and the fitted line, that’s where we’d “expect” our property to be positioned based on the comps.

Following this type of traditional comp analysis, one may conclude that the asset is a bit more expensive than its average peer. But is that really the case?

Image result for linear regression gif — The iterative process of fitting a line into a set of observations using Gradient Descent

Eliminating Cognitive Bias

If we let the clustering algorithm do the work instead of a human analyst, we eliminate any potential for cognitive bias and let the data do the work:

In this case, it turned out that there are ~10x more similar properties that may be discovered using machine learning for comp analysis. Using the enhanced comp set, it turns out that our asset (yellow dot) is actually below the green line — that is to say, the asset’s rent per two-bedroom unit is actually below the average value.

In this example we can see that by leveraging an enhanced comp set constructed by the AI, we were able to reach a fundamentally different conclusion: an asset that initially appeared as expensive to rent was actually under-occupied.

Using Virtual Neighbourhoods for more accurate Property Value Prediction

Skyline AI’s Virtual Neighbourhoods doesn’t end at comp-analysis. Virtual Neighbourhoods may also be used in another interesting fashion: improving the accuracy of predictions for current and future property values and rents.

Predictions for current and future property values and rents is done using supervised learning algorithms. Supervised learning is the machine learning task of learning a function that maps an input to an output based on labeled training data consisting of a set of training examples. A supervised learning algorithm analyzes the training data and produces an inferred function, which can be used for mapping new examples. After the model is trained, it ultimately allows for the algorithm to correctly determine the class labels for unseen instances. In our example, the class labels could be transaction values and the training data is comprised out of all of the data related to the transaction.

How can we generate more quality inputs to improve our prediction model?

One of the most significant factors in the model’s ability to generalize from the training data is the quality of the features comprising that data. For example, when we try to predict an asset’s value, it’s vital that we have quality features of the property such as the year built, condition, location, etc. And the more of these, the merrier.

Using supervised learning models: predicting asset values based on labeled-training examples

So How Can We Generate More High Quality Features to Improve Our Supervised Model?

As it turns out, Skyline AI’s Virtual Neighbourhoods can assist with this process. That is because the new generated clusters may be used as additional data points to enrich the data set used in the supervised models. This process is commonly referred to as Feature Engineering.

In this case, we are using the unsupervised learning models to distill a new feature — virtual Neighbourhoods — out of of the existing data.

Using the results of the Unsupervised Learning Clustering model to enhance the Supervised Learning Neural Network models

Injecting the property’s “Virtual Neighbourhood” as an additional new feature acts to significantly increase the models prediction accuracy, reducing it’s RMSE and improving other accuracy benchmarks.

The DNA of Real Estate

Real estate properties and their prices are complex objects. For this reason, as we’ve seen in the peer groups example, traditional methods of using just a few obvious intuitive indicators do not always work very well.

The prices of real estate involve complex interactions among many different input features which behave differently in different parts of the input space. Letting an AI mastermind, comprised of a combination of both supervised and unsupervised models, learn and understand the nature of these inputs and interactions may produce far greater results across multiple domains in real estate investment analysis.

Over the last few years, we’ve seen AI disrupt a number of traditional industries, and the real estate market should be no different. We believe that the power of Skyline AI’s technology to understand vast amounts of data that affect real estate behaviour will unlock billions of dollars in untapped value.

By Or Hiltch, Co-Founder & CTO