What impacts Boston Housing Prices

Published in

Dev Diaries

11 min readJun 21, 2020

這次學習用一個現有的dataset — Boston housing 波士頓房價，體驗監督式學習的分類法，也就是將資料區分為測試和訓練的資料堆，從訓練的資料中定義特徵變數的欄位作為x，房價欄位作為y，找出x和y之間的關係，並把這個關係用測試的資料做驗證，確認我們找出的房子特徵是否能夠準確預測房價。

這個dataset可以在scikit-learn的官方網站上找到，連結在這裡。2013年波士頓曾經推行穩定房價相關的政策，有興趣的朋友可以聽這個廣播。

Credits: http://www.wbur.org/radioboston/2013/09/18/bostons-housing-challenge

這次資料分析的步驟：

定義問題與觀察資料、資料清理、資料探索與視覺化、模型訓練。

So let’s get started.

載入資料

首先載入需要的套件：

import numpy as np
import matplotlib.pyplot as pltimport pandas as pd 
import seaborn as sns%matplotlib inline

接著載入scikit-learn的boston housing dataset:

# https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_boston.html
 
from sklearn.datasets import load_bostonboston_dataset = load_boston()

載入後可以看到這個資料集像json格式，可以接著用一下程式碼看資料集的keys:

print(boston_dataset.keys())#output: dict_keys(['data', 'target', 'feature_names', 'DESCR', 'filename'])

觀察每個keys：

data：每個房子的資訊
target：每個房子的價格
feature_names：每個房子的特徵
DESCR：這個資料集的描述

如果想知道每個欄位詳細的描述，可以用：

print(boston_dataset.DESCR)

每個欄位的詳細描述如下：

:Attribute Information (in order):
- CRIM     per capita crime rate by town
- ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
- INDUS    proportion of non-retail business acres per town
- CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
- NOX      nitric oxides concentration (parts per 10 million)
- RM       average number of rooms per dwelling
- AGE      proportion of owner-occupied units built prior to 1940
- DIS      weighted distances to five Boston employment centres
- RAD      index of accessibility to radial highways
- TAX      full-value property-tax rate per $10,000
- PTRATIO  pupil-teacher ratio by town
- B        1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
- LSTAT    % lower status of the population
- MEDV     Median value of owner-occupied homes in $1000's

因為我們要預測的是房價，所以目標變數是MEDIV，剩下的就是特徵變數。

下一步把資料轉換成pd.DataFrame：

boston = pd.DataFrame(boston_dataset.data, columns=boston_dataset.feature_names)boston.head()

上面的dataframe只包含特徵變數的欄位，所以要再加上目標變數的欄位target：

boston[‘MEDV’] = boston_dataset.target

整個dataframe就準備完成：

資料前處理

進一步處理之前先確認是否有缺失值：

boston.isnull().sum()

結果沒有缺失值：

資料探索

接下來用簡單的資料視覺來看一下細部資料之間的關係。

MEDIV房價變數的分佈大致呈鐘型常態分佈：

# 用seaborn一次把圖表的美化格式設定好，這裡是只有先設定圖表長寬
sns.set(rc={‘figure.figsize’:(10,10)})# 使用的資料是房價MEDIV
sns.distplot(boston[‘MEDV’])plt.show()

接下來我們可以看每個變數之間的關係，透過相關係數去觀察有哪些特徵變數和目標變數有較高的相關性等等：

correlation_matrix = boston.corr().round(2)
# annot = True 讓我們可以把數字標進每個格子裡
sns.heatmap(data=correlation_matrix, annot = True)

這裡我們可以看到：

跟MEDV（房價）高度相關的是LSTAT（中低收入戶佔當地居住人口的比例）和RM（房子有幾間房間）這兩個變數。
此外也看到DIS（到波士頓商業中心的距離）和AGE（屋齡），INDUS（非零售業土地使用比例）和ZN（居住使用土地比例）這兩組變數有多元共線性問題，所以未來如果要做其他模型，避免同時使用這兩組中的變數。

所以目前可以用LSTAT和RM來做出預測MEDV的模型。再次把這兩個變數跟房價變數的關係畫出來，可以看到兩者和房價變數都接近線性關係：

# 設定整張圖的長寬
plt.figure(figsize=(20, 5))features = [‘LSTAT’, ‘RM’]
target = boston[‘MEDV’]for i, col in enumerate(features):
 # 排版1 row, 2 columns, nth plot：在jupyter notebook上兩張並排 
 plt.subplot(1, len(features) , i+1)
 # add data column into plot
 x = boston[col]
 y = target
 plt.scatter(x, y, marker=’o’)
 plt.title(col)
 plt.xlabel(col)
 plt.ylabel(‘MEDV’)

準備模型的訓練資料

用np.c_把LSTAT和RM兩個欄位合併在一起，assign成X，把MEDV欄位assign成Y：

X = pd.DataFrame(np.c_[boston[‘LSTAT’], boston[‘RM’]], columns = [‘LSTAT’,’RM’])
Y = boston[‘MEDV’]

把資料切割成訓練training data（80%）和測試testing data（20%）：

# train_test_split
from sklearn.model_selection import train_test_splitX_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, random_state=5)# 再用.shape看切出來的資料的長相（列, 欄）
print(X_train.shape) #(404, 2)
print(X_test.shape) #(102, 2)
print(Y_train.shape) #(404, )
print(Y_test.shape) #(102, )

產生模型

new出一個LinearRegression的物件後，用特徵變數的訓練資料和目標變數的訓練資料產生一個模型。接著將特徵變數的測試資料倒進這個新產生的模型當中，得到預測的目標變數資料。最後將這個預測的目標變數資料（預測結果）和目標變數的測試資料（真實結果）做R2-score：

# Modeling
from sklearn.linear_model import LinearRegression
reg = LinearRegression()# 學習/訓練Fitting linear model
reg.fit(X_train,Y_train)# 預測結果Predicting using the linear model
reg.predict(X_test)# 真實結果：Y_test# 測試準確度：
print(‘R2: ‘, reg.score(X_test, Y_test))

得出R2-score結果為：

R2:  0.6628996975186954

得到的這個R2-score讓我們可以知道特徵變數對於目標變數的解釋程度為何，而越接近1代表越準確。這裡大約是66%，解釋程度算是相當好的。

如果我們把剛剛的預測的目標變數資料和測試的目標變數資料畫成散佈圖，可以看到兩者關係接近斜直線1：

# plotting the y_test vs y_pred
Y_pred = reg.predict(X_test)
plt.scatter(Y_pred, Y_test)
plt.xlabel('Y_pred')
plt.ylabel('Y_test')
plt.show()

可以把模型的intercept和coefficient找出來：

reg.intercept_#output: 2.7362403426066138coeff_df = pd.DataFrame(reg.coef_, X_train.columns, columns=['Coefficient'])  
coeff_df#output: Coefficient
LSTAT -0.717230
RM 4.589388

關係式為：

MEDIV = 2.74 ＋ (-0.717230) * LSTAT + 4.589388 * RM + error

結論

我們用LSTAT（中低收入戶佔當地居住人口的比例）和RM（房子有幾間房間）藉由多元線性迴歸預測MEDIV（房價）。
在其他變數保持不變下，當LSTAT（中低收入戶佔當地居住人口的比例）增加1 unit，MEDIV（房價）就會大約下降0.72 unit。同樣地，當RM（房子有幾間房間）增加1 unit，MEDIV（房價）就會大約上升4.59 unit。看來房子有幾間房間比中低收入戶的比例對於房價有更多的影響力。

這樣就完成了！

完整程式碼：

這次的分享到這裡。希望有幫助！

祝福大家都有美好的一天 :)

Inspired by https://towardsdatascience.com/linear-regression-on-boston-housing-dataset-f409b7e4a155