Using and Understanding Machine Learning (ML) Models

Part 1 of 2

Ola Zytek
Data to AI Lab | MIT
7 min readApr 26, 2024

--

The Farmhouse at Iowa State University in Ames. Photo by Alan on Flickr

Machine learning (ML) is being used more and more in an increasingly large number of domains, including healthcare, finance, social welfare, and energy. ML models look at data and make predictions about the future — but getting the greatest benefit from ML involves looking at more than just predictions. ML explanations offer insights into how a machine learning model “thinks” about the world.

In this series of tutorials, we will talk about how to harness ML and ML explainability to understand a domain and make better decisions. In this post, Part 1, we will go over the basics of preparing an ML model to predict house prices. In Part 2, we will talk about how to use and understand that ML model.

All code used in this tutorial can be found in our accompanying colab notebook. The code uses Pyreal, a python package that makes it easy to use and understand ML models.

If you are following along with your own code, you can install Pyreal from pip with pip install pyreal,or by following the installation instructions here.

Predicting House Prices

The housing market is a complex system that affects many people, from prospective homebuyers, sellers, and real estate agents to homeowners and renters. Hundreds of subtle factors combine to determine the ultimate selling price of a house , including location, size, utilities, rooms, and accessibility. The Ames Housing Dataset, for instance, gives values for 79 such factors for almost 3,000 houses sold in Ames, Iowa from 2006 to 2010.

As a human, looking at this amount of information is a daunting task. But with a little help from machine learning (ML), it’s possible to glean a whole lot of useful information from this data. ML models find patterns and correlations between different factors — or features — to make predictions about one or more target variables — in this case, house price.

With the right tools, ML can not only make predictions about new data (for instance, predicting the price of a new house that was not included in this dataset) — it can also offer detailed explanations of any correlations that the model found between different types of information, answering questions like “Why is this house so expensive?” and “What factors did the housing market value at this time?”

Loading in the Data

Let’s start by getting familiar with the data we will be using. We will load in the Ames Housing Dataset and take a look at a subset of it.

Traditionally in ML, we refer to the input data as X (that is, the features that the model will use to make predictions) and the target values as y (that is, the values that model will be predicting — in this case, sale price). We will use this set of houses with known sale prices to predict the sale prices of future houses.

Our input data X is stored in a pandas DataFrame. DataFrames store information in tables of named columns and rows. In our case, each row refers to a single house, and each column refers to a feature a single piece of information about the house, such as its lot size, number of bedrooms, or location.

For example, a snippet of our data table X looks like:

While you probably don’t understand all the features just by looking at them (at a later step, we’ll add in feature descriptions to fix this), you can get a sense of the kind of information we are working with. For example, we can see how much of the lot touches public land (LotFrontage), the total size of the lot (LotArea), the rough shape of the lot (LotShape), and the utilities available (Utilities). If you run the code yourself, you can look at all of the 79 features we have for each house.

Next we have y, a pandas Series containing what we call the target feature or ground-truth values — in other words, the value that we want our ML model to predict. In this case, each entry in y is the actual price that the house represented in the corresponding row in X sold for. We can see that the three houses in our sample table above sold for $129,000, $118,000, and $129,500, respectively.

Usually when training ML models, we split our data into a training set and a testing set. The training set is used to train the model, and the testing set is used to check the model’s performance. This allows us to see how well the model generalizes to unseen data. Scikit-learn (sklearn) offers a convenient helper function to split our data into these two sets.

Now we have our training and testing data. We are almost ready to start training a model — but first, we need to talk about transforming data.

Transforming Data

Most ML models require data to be in a specific format. For example, many cannot take in categorical features (such as our MSZoning feature above) directly — they require all data to be numeric. Or they can’t handle missing data (such as in our Alley feature). Many also will not perform well if features are on very different numeric scales (for example, the LotArea values may significantly outweigh the LotFrontage values).

To address these kinds of issues, we transform the data. Here, we will introduce three common data transformation types:

Imputing replaces missing values (None or N/A) with reasonable replacement values, such as the most common or average (mean) value in the dataset. In the case of the Ames Housing dataset, we can usually guess the meaning of a missing value — for example, a “basement size” value of N/A usually means there is no basement, and therefore the most reasonable value to impute would be 0.

One-Hot Encoding transforms categorical features into numeric features. In this process, we turn a single column into one column per feature value. We set the value-column corresponding to the row’s value to True, and all others to False (and then represent these values as True=1, False=0).

For example, our Street variable can take two values — Grvl (gravel) or Pave (paved). After one-hot encoding this variable, our data will look like:

Standardization scales numeric features to have a mean of 0 and variance of 1, effectively allowing all numeric features to be represented on the same scale.

All three of these transformers can be easily instantiated in Pyreal from the transformers module. For imputing, instead of using a default imputer (which imputes categorical features with the most common value and numeric features with the average value), we will use our hand-made AmesHousingImputer, which factors in our understanding of the domain. For example, this custom transformer will impute BasementSize feature values of N/A with 0, as this usually means there is no basement.

Initializing the transformers is straightforward with Pyreal. All transformers take an optional columns parameter, which specifies which columns to transform; in this case, we want to impute all features, one-hot encode the categorical features, and then standardize all features (which will all be numeric after one-hot encoding).

In the last line above, we fit our transformers. Most transformers need to be fit to training data; for example, to determine categorical feature values or feature ranges/mean values. Typically, we fit our transformers to our training data only, and then transform our testing and input data with these pre-fit transformers.

Training an ML Model

We are now ready to train our ML model!

There are many Python libraries out there for training ML models, such as:

For this exercise, we will use XGBoost, a popular and powerful library that offers classifiers and regressors using the gradient boosting framework. It is an effective choice for many ML use cases.

XGBoost models require data to be numeric (so one-hot encoding is required), but can handle missing and non-standardized data natively. For simplicity, for this tutorial we will use the full transformer set introduced earlier.

Training an ML model is simple — we transform our data, fit the model to the training data, and then score it on the testing data. We can improve the performance of the model by experimenting with different hyperparameters — that is, the tuning values used for training. Our sample code contains some options we’ve found to work well on this dataset.

The final line shows the R² score (or coefficient of determination) of the model. This score shows how accurately the model fit the data and predicted the unseen house prices in the testing dataset. Values closer to 1 mean a better fit; our model scores about .92.

You can now make predictions on data by using the model’s .predict function on the data after running it through your transformers.

We now know how much money the model predicts these houses will sell for — but we don’t know how it came up with these numbers. It would be hard to use these predictions alone to price a real house — what if there are factors the model missed, or something unreasonable about its logic? Also, if we stick to these single predictions, we don’t learn anything about the housing market as a whole.

In Part 2 of this tutorial, we will go through using Pyreal — a python library that makes it easy not only to use your ML model, but to understand it and its predictions.

--

--