CoreML — Boston Prices exploration

In the previous post of this series we described some of the basics of linear regression, one of the most well-known models in machine learning. We saw that we can relate the values of input parameters

x_i

to the target variable

y

to be predicted. In this post we are going to create a linear regression model to predict the price of houses in Boston (based on valuations from 1970s). The dataset provides information such as Crime (CRIM), areas of non-retail business in the town (INDUS), the age of people who own the house (AGE), average number of rooms (RM) as well as the median value of homes in $1000s (MEDV) as well as other attributes.

Let us start by exploring the data. We are going to use Scikit-learn and fortunately the dataset comes with the module. The input variables are included in the data method and the price is given by the target. We are going to load the input variables in the dataframe boston_df and the prices in the array y:

import pandas as pd
from sklearn import datasets
boston = datasets.load_boston()
boston_df = pd.DataFrame(boston.data)
boston_df.columns = boston.feature_names
y = boston.target

We are going to build our model using only a limited number of inputs. In this case let us pay attention to the average number of rooms and the crime rate:

X = boston_df[['CRIM', 'RM']]
X.columns = ['Crime', 'Rooms']
X.describe()

The description of these two attributes is as follows:

Crime       Rooms
count 506.000000 506.000000
mean 3.593761 6.284634
std 8.596783 0.702617
min 0.006320 3.561000
25% 0.082045 5.885500
50% 0.256510 6.208500
75% 3.647423 6.623500
max 88.976200 8.780000

As we can see the minimum number of rooms is 3.5 and the maximum is 8.78, where as for the crime rate the minimum is 0.006 and the maximum is 88.97, nonetheless the median is 0.25. We will use some of these values to define the ranges that will be provided to our users to define the predictions.

Finally, let us visualise the data:

We shall bear these values in mind when building our regression model in subsequent posts.

You can look at the code (in development) in my github site here.


Originally published at Quantum Tunnel Website.

One clap, two clap, three clap, forty?

By clapping more or less, you can signal to us which stories really stand out.