Towards Machine Learning in Pharo: Visualizing Linear Regression
A small tutorial that will teach you how to fit a regression line to Boston Housing data and visualize it with Roassal3
This is a small tutorial on how to estimate prices of houses in Pharo using linear regression model from PolyMath. We will then visualize the data points together with the regression line using the new charting capabilities of Roassal3.
The main purpose of this blog post is to demonstrate the new charting functionality of Roassal3 that were introduced yesterday. The visualization that we will build is not very pretty, but it will give you a taste of the amazing things that we will be able to do in the near future.
Pharo is a pure object-oriented programming language and a powerful environment, focused on simplicity and immediate feedback (think IDE and OS rolled into one). If you are new to Pharo, you can install it by following the instructions on https://pharo.org/download.
After you have installed and oped your Pharo image, open Playground (Ctrl+OW) and execute (select it and press Ctrl+D) the following Metacello script to install Datasets library. This library will allow you to download different datasets (including Boston Housing dataset which we will be using in this tutorial) and load them into your image as DataFrame objects:
Now run this script to install PolyMath. It is a library for scientific computing in Pharo which contains a
Finally, execute this script to install Roassal3. The features that I will be showing you have been added several hours ago. There is a strong probability that API will be changed in the following days, so to make sure that everything works at the time you read this post, you will have to load Roassal3 on a specific commit
The changes that I am presenting in this blog post are not integrated into the masters branch of Roassal3 yet, so you will have to load them separately. To do that, open Iceberg (Ctrl+OI), click on Roassal3, find
Roassal3-Matplotlib package and press Load.
Loading Boston Housing Dataset
To load Boston Housing dataset, simply run
boston := Datasets loadBoston.
This will give you a DataFrame with 14 columns:
1. CRIM: per capita crime rate by town
2. ZN: proportion of residential land zoned for lots over 25,000 sq.ft.
3. INDUS: proportion of non-retail business acres per town
4. CHAS: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
5. NOX: nitric oxides concentration (parts per 10 million)
6. RM: average number of rooms per dwelling
7. AGE: proportion of owner-occupied units built prior to 1940
8. DIS: weighted distances to five Boston employment centres
9. RAD: index of accessibility to radial highways
10. TAX: full-value property-tax rate per $10,000
11. PTRATIO: pupil-teacher ratio by town
12. B: 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
13. LSTAT: % lower status of the population
14. MEDV: Median value of owner-occupied homes in $1000's"
We will be using columns
MEDV to study the relation between the average number of rooms and the price of a house.
rooms := boston column: 'RM'.
price := boston column: 'MEDV'.
Fitting a Line to Data with PolyMath
The idea behind a simple univariate linear regression is the following:
- We are given a collection of points
(x, y)— in our case,
xis the number of rooms and
yis the price — and
- We need to find two values
bsuch that the line
y = kx + bis the best approximation of all given points. Those values are called “slope” and “intercept”.
More specifically, if we want to use slope
k and intercept
b to predict
yᵢ (price) for every
xᵢ (number of rooms) as
We need select
b in such way that the sum of square differences between the real
yᵢ and the predicted
yᵢ´ is the smallest:
I will not go into details of how this is done (but I encourage you to read about different ways to compute a linear regression — it’s very interesting). We can find slope and intercept for given points using a
PMLinearRegression class from PolyMath. To do that, we simply add points to it and extract the values of slope and intercept:
regression := PMLinearRegression new.1 to: rooms size do: [ :i |
point := (rooms at: i) @ (price at: i).
regression add: point ].k := regression slope.
b := regression intercept.
Now we can build a prediction for the price of a building based on the number of its rooms:
predictedPrices := k * rooms + b.
These predictions are all on the same line, which is called the “regression line”.
Plotting the Regression Line with Roassal3
Finally, we can build a chart using Roassal3 visualization library, that will show us all our points and a line that goes through them.
We start by creating an empty chart:
chart := RSChart new.
Now we create a scatterplot of points:
points := RSScatterPlot new
We also create a line:
regressionLine := RSLinePlot new
Then we add our scatterplot and line to the chart:
We give our chart a custom title and custom labels for both of its axes:
title: 'Boston Housing';
xlabel: 'Number of rooms';
Now we can see the chart by selecting word
chart and inspecting it (Ctrl+I) or (Ctrl+G):