Yet Another Boston Housing Data Example

Using the SimplML macOS hybrid modeler

NTTP
6 min readFeb 28, 2024
Boston. Or so says Clifford. Cliff for short? Photo by Clifford on Unsplash

When learning a new code or method for forecasting, it is sometimes useful to read in an old familiar dataset to see how the new code performs versus prior methods you know. For this example use case of our SimplML app, we take the classic Boston housing price data set which is used in many machine learning or regression examples here on Medium and elsewhere (just search for Boston housing or Boston house) and see what happens:

https://www.kaggle.com/datasets/vikrishnan/boston-house-prices/data

The macOS app SimplML can be obtained on the macOS app store (free for now!):

https://apps.apple.com/us/app/simplml/id6449885047?mt=12

Original data seems like it is from 1978, well into the Fortran era (less than 6 character variable names, and all caps yet), and formatted as such with two lines per data point; so we pre-processed the data to make it a CSV file, one data point per line, per modern usage.

We won’t redescribe the variables in detail since this has been done many times elsewhere, but we have 13 candidate predictors of the final column MEDV (median value housing price), two variables of which look like they can be treated as categorical variables: CHAS (1 = property tract bounds the Charles River, and 0 if not), and RAD (index of accessability to radial highways). Yeah, index sounds like a categorical variable, so we will treat it as such without looking too much into the concept behind that variable… since this is just a quick test.

First we read in the CSV file using the Select CSV file to model button, then we flip the switches to indicate which variables are X predictors, and which variable is the Y value to forecast (MEDV).

Then we flip the categorical switches on for CHAS and RAD.

CHAS and RAD variables, showing categorical switches flipped on (right).

Just so we are all on the same page, we press the reset button to set the modeler to its defaults, and then we set both the max and starting # of basis functions to 400. Since our data set has 506 points and we withhold 20% of those for testing (our default setup), our training point count is about 400, so 400 starting basis functions should give us a pretty thorough basis function search space.

This small size data set would not scale up well onto multiple CPU cores due to thread overhead in SimplML’s way of parallelization, so we leave CPU core count at 1.

We want to use the default forward stepwise matrix method here, so we just press solve.

First trial results:

First trial results. Red mark indicating where we leave the CPU count at 1. Red dot plot is the 20% withheld test set.

Train AdjR2 90.3% RMS error 2.79
Test R2 77.7% RMS error 4.14

We pay more attention to AdjustedR2 for training (instead of regular R2), because AdjustedR2 adjusts for the number of basis functions in the model… hence, the name.

The training metrics being notably better than the test metrics suggest overfit, so we do a slight tune of the hyper parameter function width, changing it to 1 (was 2). This makes each basis function narrower, more localized around its center, less global.

Second trial, red mark indicating where we adjusted basis function width to 1. Dot plot is from TEST data set. On the dot plot, we don’t see any predicted values going below zero any more (as in the last screenshot), so that’s good.

Train AdjR2 87.2% RMS error 3.21
Test R2 79.1% RMS error 3.92

Slight improvement in OOS test.

Next, getting rid of a couple of variables that look to be low contributors from SimplML’s variable sensitivity estimates (CHAS and INDUS), we now get:

By setting the left switch to none for INDUS and CHAS, we remove those variables from the model. We must re-solve the model after doing this.
Results for model with CHAS and INDUS variables removed. This report indicates that some additional variables might be removed if we feel like it: (RAD, ZN, AGE for example)

Train AdjR2 88.2% RMS error 3.14
Test R2 83.2% RMS error 3.50

Here this final model has 16 total basis functions (reported on the lower right of the green output area), of which 1 is an ordinary linear term and 13 are directional, leaving 2 radial functions. Each of these functions has only one weight to solve for, so this is a 16 weight model plus a constant (and of course we should count our single hyperparameter width function, which we slightly tuned). Still, 16 basis functions (one weight each) to solve 11 variables and about 400 points is a fairly compact model in the ML space.

Same as above but now showing “of fit” dot plot on upper left. Use “cycle plots” button [red mark above] to cycle through the plots.

We can then use the 2D slice feature along with the variable sliders to visualize the output of the model at different parameter settings. The exact MEDV forecast value at given slider settings shows up at the top of the user interface.

Cross section slice of the model at a point looking at the NOX variable. This is not the main effect of the NOX variable, but a true slice through the model, with all other parameter settings as positioned on the sliders. Output Y value at these slider settings for MEDV (30.2428, though be careful not to read too much into the 6 figures shown here… the model is not accurate to 6 figures) is on the last line of the scrolling colorized variable rows in the above screen shot. Gentle curvature shows the nonlinear nature of the model.

Additionally, we can read in a new table of data to evaluate using the Select CSV File To Evaluate option on the upper left of the app. After selecting the file to evaluate, the results are automatically computed and saved in tabular form to the output folder, which can be accessed by the export folder button.

Export folder button. This folder contains intermediate results and dot plot results also, in HTML form. Double click on an HTML plot file to read it into your default browser.

Here we just re-read the original data file to show how to do it:

The other way to evaluate the model aside from manipulating the sliders is to read in a whole set of data points with the same columns as the training data. Here we just read in the original data set again. Above screenshot shows the first points of the results

Additional model tuning may be possible by adjusting hyperparameters further and/or eliminating additional variables. More theory on hyperparameters and the model itself are in our formal white paper:

https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4484861

Summary

This simple nonlinear regression example shows how we can quickly create a model to fit nonlinear data in a “no code” manner using the SimplML app, and validate it forthwith.

Poor man’s cross validation

After finding a model you like, you can set the randomizer switch from direct to cached:

New randomizer switch for Feb 2024 upper right.

With the switch at direct, the results should come out the same each time when you don’t change the model settings and just re-press solve, as the random values used in test point selection and basis function initial selection are seeded with a constant and computed when needed when this switch is at direct. With the switch set to cached, the random values are not reseeded each trial and you get different selections of test points each time, along with (likely) different basis function initial candidates. Hence, by re-running the same model several times with this switch set to cached, you can get an idea how stable the model is from a cross validation point of view by examining the test data output metrics each trial.

Further reading

Some cool Python examples using this dataset:

If you scan through these, you can see that our OOS test data metrics are not too bad, versus. Be sure to compare RMSE and R2 of test sample evaluation. R2 and RMS of fit can of course yield false positives. SimplML also reports max errors and other metrics that we did not discuss in this article. It looks like they are using 30% of data withheld for testing, so also be cautious of direct comparisons until you re-run this in SimplML with 30% data withheld.

The author suggests that of his tests, XGBoost gave the best OOS results, copied and pasted here (arrows added):

R^2: 0.8494894736313225 <---
Adjusted R^2: 0.8353109457849979
MAE: 2.4509708843733136
MSE: 15.716320042597493
RMSE: 3.9643814199188117 <---

Compare this to our OOS R2 (with 20% of data) at 83.2% and RMS error of 3.50.

It is likely that both our model and the above Python author’s models could be hyper-tuned more; these are just walk-thru examples. We don’t adjust the R2 of test data results because it seems a little odd to do that, as AdjR2 is usually reported “of fit.” If you have a low number of test data points, AdjR2 would be artificially low we think. Also it is uncertain (to us at this time) exactly how the above Python code adjusts the R2 for the various model types.

--

--