Using and Understanding Machine Learning (ML) Models

Part 2 of 2

Ola Zytek
Data to AI Lab | MIT
8 min readApr 26, 2024

--

In Part 1 of this tutorial, we learned how to prepare data for ML and trained a model to predict house prices in Ames, Iowa.

Here in Part 2, we will learn how to use this model to generate house price predictions. In addition, we’ll learn how to get ML explanations, which allow us to see where the model’s predictions came from and to understand the housing market better as a whole.

Sample Houses

Let’s say we’ve got a list of 10 houses, newly on the market, that we want to look at, understand, and get price predictions for. We can load these in from Pyreal:

A sample of our data looks like this:

The primary differences between this data and the training data from Part One of this tutorial are that a) we don’t yet know the ground-truth sale price for these houses (since they haven’t been sold yet), and b) we have an “Address” column (for this tutorial, these are fake addresses).

Getting Started with Pyreal

Pyreal is a Python library that makes it easy to use and understand ML models and their predictions. You can install Pyreal from pip with pip install pyrealor by following the installation instructions here.

Most of Pyreal’s functionality is encapsulated in a RealApp object, which is instantiated with a few inputs, all of which are discussed in Part 1 of this tutorial:

  1. The training data (X) and targets (y)
  2. The trained ML model
  3. Any transformers needed to transform the data from its original state to the state expected by the model, as described in Part 1 of this guide. In this case, this includes a custom imputer, a one-hot encoder, and a standardizer.

Additionally, we can make the predictions and explanations we will generate easier to read and use with:

  1. A dictionary of feature descriptions, which links the feature names used in data columns to readily-understandable descriptions. For example, for Ames Housing, we have entries such as {LotFrontage: Linear feet of street connected to property, LotArea: Lot size in square feet, ...}
  2. A function that puts the model prediction in a readable format, such as converting numeric values to formatted dollar amounts (e.g. 12343.922 to $12,343.92).
  3. If the input data has an ID column (such as our Address column), we can pass this in as well, which will allow us to access outputs later via these IDs.

Putting all this together for our housing dataset, we can initialize our RealApp with:

Predicting

The most basic yet important functionality of an ML model is making predictions. In this case, we want to predict the price of a new house we are interested in. To do this, we will use the RealApp .predict() function.

This function allows you to get model predictions without having to worry about running the transformers yourself — you pass in your original data, and the RealApp handles the rest.

Let’s investigate one of our sample houses — 3880 Hazelwood Ave.

Inputting this first house of interest, we can see that it’s predicted to sell for $172,878.25. But why? The predicted price alone might be helpful if we are blindly and rapidly selling houses, or to get a quick and rough estimate, but in the real world, we often want more information.

For example, we might ask questions like:

  1. What information about the house (features) contributed to this prediction of a sale price around $170,000?
  2. What features, in general, does the model consider most important for predicting house prices?
  3. How does the model use specific features, like the “Above ground living space” feature? Are bigger houses always predicted to be more expensive? Are there diminishing returns on house size?
  4. Have we seen houses in the past that were similar to this one, and what did they sell for?

To address these questions, we can use ML explanations, which can be accessed easily using Pyreal.

What features contributed to the house price prediction?

To answer the first question — what information about the house has contributed to this prediction — we can use what’s called a feature contribution explanation. These explanations tell you how much each feature, in dollars, contributed positively or negatively to the final house price prediction.

We generate these explanations with the corresponding function in our RealApp object. The output of this function gives us a DataFrame with feature names, values, and their contribution to the prediction, which we can then visualize using Pyreal’s visualize module:

The result looks something like this (you may see differences due to randomness in the model training and explanation generation processes):

Red bars (to the left) represent features that reduced the house price prediction. Blue bars (to the right) represent features that increased the house price prediction. The x-axis shows the amount each feature contributed, in dollars.

We can see that the house’s price prediction was significantly reduced because of its size and material rating, by around $15,000 and $10,000, respectively. On the other hand, its large basement, garden walls, and overall condition increased its prediction. Note that these contributions are applied to the average house price across all houses in the training dataset (in this case, around $180,000). Summing together this average price with the contributions of every feature will give you the predicted price for the house.

It looks like the model put a lot of stock into the house size for this house. This seems to make sense, but we can’t be sure if there was something particular about this house’s size that makes it important, or if house size is always important for the model.

We can use another type of explanation to see what features the model finds important in general, to get a better sense of how our model works and what correlations it has found in the housing market in general.

What features are most important for house prices in general?

We can see what features the model considers to be most important for house price predictions in general with a feature importance explanation. It’s worth noting that theses features may not necessarily be causally linked to house prices — they are just the features the model uses the most, and could just be correlated.

Once again, we use the corresponding produce and visualize functions to see our explanation:

This will give us a plot like:

Importance values can be thought of as unitless relative importance scores, and are always positive. The larger the bar, the more important the model considers that feature to be, relative to other features.

Here we see some similar features to the ones specifically important for our house on Hazelwood Ave, but also some new ones. The overall quality of materials and size of the house and basement appear to be quite important to the model for predicting house prices. Additionally, the model seems to consider if and when the house was remodeled. Here, we can start to get a hint of what the market values overall.

But from this explanation alone, we still don’t fully understand how the model uses individual features. We can see that house size (living area) is important — but does this mean that bigger houses sell for more? At what size does the size of the house start increasing the price (relative to the average house price) instead of reducing it? Are there diminishing returns on house size?

We can use another explanation to better understand how the model uses specific features, and answer these questions.

How does the model use specific features?

Knowing that a feature is important isn’t the most useful information unless we understand how the model uses that feature in general. Did the “Above ground living space” feature contribute negatively for 3880 Hazelwood Ave. because that’s a small size, or because it’s a large size and the model actually thinks bigger houses should sell for less? Are there diminishing returns on house size?

To answer these questions, we can aggregate feature contribution explanations across the full training dataset:

This gives us:

Each dot in this plot represents one house from the training dataset. The x-axis lists the possible feature values for the selected feature, while the y-axis shows the contribution that value had to the model’s prediction for that house.

We can see that larger above-ground living areas do indeed increase the model’s predictions, while lower areas decrease it, with a transition point happening around 1,700 square feet — 3880 Hazelwood Ave., with a size of around 1,200 square feet, is smaller than average. At larger sizes, the model is also less consistent with how much change in price it attributes to size, as indicated by the larger spread of points toward the right size of the plot.

We can use this same method to investigate a different feature, the overall quality of materials:

Which gives us:

Again, we can see that higher quality materials lead to higher predicted house prices (which makes sense). We can also see the transition point where the model starts considering the material “better than average” at a score of 7. We can also see that among very low material ratings (1–4), there is little difference in the degree to which the model considers them important, while at high ratings we can see significant increases to predicted house price based on this feature, all the way up to almost $100,000.

Let’s revisit our house on Hazelwood Ave to address the final question we suggested — what happened in other cases similar to this one?

Which houses are similar to this one, and what did they sell for?

Another way to better understand how a house sits within the larger housing market context is to look at similar houses and see how much they sold for. A house is considered “similar” to an ML model if it has similar feature values — similar size, location, utilities, etc.

You can get a list of similar houses, as well as the amount they sold for, by using the produce_similar_examples function:

The resulting table shows a few houses similar to the one at 3880 Hazelwood Ave., while highlighting the differences.

For example, house 664 is similar in most ways to 3880 Hazelwood Ave., and sold for around $165,000 — very similar to our model’s predicted price of around $170,000. A few differences could lead to different prices, however, such as 3880 Hazelwood Ave. having a slightly smaller (but more regularly shaped) lot and being made from different materials.

Final Thoughts and Next Steps

In this two-part tutorial, we went over the basics of preparing an ML model and using Pyreal to use and understand it.

ML models can be powerful assets in many domains, if you have the tools to use and understand their predictions. For next steps, I recommend taking a look at some of the documentation linked throughout these guides for further ML model and explanation options.

--

--