# Towards Machine Learning in Pharo: Visualizing Linear Regression

## A small tutorial that will teach you how to fit a regression line to Boston Housing data and visualize it with Roassal3

This is a small tutorial on how to estimate prices of houses in Pharo using linear regression model from PolyMath. We will then visualize the data points together with the regression line using the new charting capabilities of Roassal3.

The main purpose of this blog post is to demonstrate the new charting functionality of Roassal3 that were introduced yesterday. The visualization that we will build is not very pretty, but it will give you a taste of the amazing things that we will be able to do in the near future.

# Installation

Pharo is a pure object-oriented programming language *and* a powerful environment, focused on simplicity and immediate feedback (think IDE and OS rolled into one). If you are new to Pharo, you can install it by following the instructions on https://pharo.org/download.

After you have installed and oped your Pharo image, open Playground (Ctrl+OW) and execute (select it and press Ctrl+D) the following Metacello script to install Datasets library. This library will allow you to download different datasets (including Boston Housing dataset which we will be using in this tutorial) and load them into your image as DataFrame objects:

`Metacello new`

baseline: 'Datasets';

repository: 'github://PharoAI/Datasets';

load.

Now run this script to install PolyMath. It is a library for scientific computing in Pharo which contains a `PMLinearRegression`

class:

`Metacello new`

repository: 'github://PolyMathOrg/PolyMath:v1.0.1/src';

baseline: 'PolyMath';

load.

Finally, execute this script to install Roassal3. The features that I will be showing you have been added several hours ago. There is a strong probability that API will be changed in the following days, so to make sure that everything works at the time you read this post, you will have to load Roassal3 on a specific commit `b9fa9e1`

:

`Metacello new`

baseline: 'Roassal3';

repository: 'github://ObjectProfile/Roassal3:b9fa9e1';

load.

The changes that I am presenting in this blog post are not integrated into the masters branch of Roassal3 yet, so you will have to load them separately. To do that, open Iceberg (Ctrl+OI), click on Roassal3, find `Roassal3-Matplotlib`

package and press Load.

# Loading Boston Housing Dataset

To load Boston Housing dataset, simply run

`boston := Datasets loadBoston.`

This will give you a DataFrame with 14 columns:

`1. CRIM: per capita crime rate by town`

2. ZN: proportion of residential land zoned for lots over 25,000 sq.ft.

3. INDUS: proportion of non-retail business acres per town

4. CHAS: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)

5. NOX: nitric oxides concentration (parts per 10 million)

6. RM: average number of rooms per dwelling

7. AGE: proportion of owner-occupied units built prior to 1940

8. DIS: weighted distances to five Boston employment centres

9. RAD: index of accessibility to radial highways

10. TAX: full-value property-tax rate per $10,000

11. PTRATIO: pupil-teacher ratio by town

12. B: 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town

13. LSTAT: % lower status of the population

14. MEDV: Median value of owner-occupied homes in $1000's"

We will be using columns `RM`

and `MEDV`

to study the relation between the average number of rooms and the price of a house.

`rooms := boston column: 'RM'.`

price := boston column: 'MEDV'.

# Fitting a Line to Data with PolyMath

The idea behind a simple univariate linear regression is the following:

- We are given a collection of points
`(x, y)`

— in our case,`x`

is the number of rooms and`y`

is the price — and - We need to find two values
`k`

and`b`

such that the line`y = kx + b`

is the best approximation of all given points. Those values are called*“slope”*and*“intercept”*.

More specifically, if we want to use slope `k`

and intercept `b`

to predict `yᵢ`

(price) for every `xᵢ`

(number of rooms) as

We need select `k`

and `b`

in such way that the sum of square differences between the real `yᵢ`

and the predicted `yᵢ´`

is the smallest:

I will not go into details of how this is done (but I encourage you to read about different ways to compute a linear regression — it’s very interesting). We can find slope and intercept for given points using a `PMLinearRegression`

class from PolyMath. To do that, we simply add points to it and extract the values of slope and intercept:

regression := PMLinearRegression new.1 to: rooms size do: [ :i |

point := (rooms at: i) @ (price at: i).

regression add: point ].k := regression slope.

b := regression intercept.

Now we can build a prediction for the price of a building based on the number of its rooms:

`predictedPrices := k * rooms + b.`

These predictions are all on the same line, which is called the *“regression line”*.

# Plotting the Regression Line with Roassal3

Finally, we can build a chart using Roassal3 visualization library, that will show us all our points and a line that goes through them.

We start by creating an empty chart:

`chart := RSChart new.`

Now we create a scatterplot of points:

`points := RSScatterPlot new`

x: rooms

y: price.

We also create a line:

`regressionLine := RSLinePlot new`

x: rooms

y: predictedPrices.

Then we add our scatterplot and line to the chart:

`chart`

addPlot: points;

addPlot: regressionLine.

We give our chart a custom title and custom labels for both of its axes:

`chart`

title: 'Boston Housing';

xlabel: 'Number of rooms';

ylabel: 'Price'.

Now we can see the chart by selecting word `chart`

and inspecting it (Ctrl+I) or (Ctrl+G):