Implement Linear Regression on Boston Housing Dataset by PyTorch

Treee July
Analytics Vidhya
Published in
4 min readNov 8, 2019
Photo by Ksenia Makagonova on Unsplash

This article aims to share with you some methods to implement linear regression on a real dataset, which includes data including, data analysis, datasets split and regression construction itself. To learn PyTorch well, I’d demonstrate regression by PyTorch and show you the charm of PyTorch in forward and backward.

This story has a hypothesis that all the readers have been familiar with the principle of linear regression. Readers should understand the meaning and solution methods of W and b of the equation Y = XW + b. To have a better experience, it’s better to understand the gradient descent method that can be used to solve the problem and understand the MSE used to evaluate the regression performance.

Boston Housing Dataset processing

Boston Housing Dataset is collected by the U.S Census Service concerning housing in the area of Boston Mass.

Packages we need

We utilize datasets built in sklearn to load our housing dataset, and process it by pandas.

Peek dataset

The datasets we loaded has been formatted a dict, hence we can know what fields it has by using .keys() method.

As we can see, there exist six fields:

  1. data: the content of features, which are what we focus on.
  2. target: the price of houses, which are what we need to predict.
  3. feature_names: as its name, feature names. storing the meanings of each column respectively.
  4. DESCR: the description of this dataset.
  5. filename: the path of this dataset storing.

Much more, watch the size of the dataset.

Size of the dataset

Preprocessing

Firstly, load our data to DataFrame by Pandas. DataFrame can be recognized as a high dimension sheet, we use it here as a two-dimension matrix.

For easy viewing, we map the name of the future to each column of DataFrame. Then peek the first 5 rows of data by .head() after adding a ‘Price’ column to our data.

Check the description of the data by .describe().

df.describe()

It can be seen that the value range of data is different and the difference is large, so we need to make standardization. Suppose each feature has a mean value μ and a standard deviation σ on the whole dataset. Hence we can subtract each value of the feature and then divide μ by σ to get the normalized value of each feature.

Lambda expression is used to simplify code.

Split training data and testing data

Format data as an array in numpy first.

Then, divide our data as a training set and a testing set.

We’ll get the following result.

Construct Linear Regression by PyTorch

Import PyTorch first.

Here I use version 1.3.0 on my computer.

Data processing

Convert data to tensor which is supported by PyTorch.

Construct the neural network

We use nn.Sequential defines a neural network with one layer and initialize it.

Only two parameters are accepted by nn.Linear, which are the dimension of weight and the dimension of output respectively.

Parameters don’t need to be initialized in our examination because Linear will do it automatically.

The usage of DataLoader

DataLoader is implemented in PyTorch, which will return an iterator to iterate training data by batch. It’s easy to use, let’s start from constructing a Dataset of Tensor.

datasets = torch.utils.data.TensorDataset(X_train, Y_train)

Then, generate a DataLoder by using this Dataset.

train_iter = torch.utils.data.DataLoader(datasets, batch_size=10, shuffle=True)

batch_size is the size of each batch in which data returned. Data will be returned in random sequence if shuffle is True.

Loss function and optimizer

We must define loss function before training the neural network, here we use Mean Square Error(MSE).

Mean Square Error
loss = torch.nn.MSELoss()

After that, optimize the neural network by stochastic gradient descent.

optimizer = torch.optim.SGD(net.parameters(), lr=0.05)

Here 0.05 is the learning rate.

Training and evaluation

Now, let’s start training.

Train the training set for 5 epochs. The training process is roughly as follows.

  1. Load a batch of data.
  2. Predict the batch of the data through net.
  3. Calculate the loss value by predict value and true value.
  4. Clear the grad value optimizer stored.
  5. Backpropagate the loss value.
  6. Update optimizer.

The following content will be displayed after training.

Training process

Now, let’s check its performance on the testing dataset.

print(loss(net(X_test), Y_test).item())
Loss value

It is not much different from the training set.

We also can watch the prediction of a sample.

Watch one sample
Output

--

--

Treee July
Analytics Vidhya

Graduate student at Chongqing University, major in soft engineering.