Boston Housing: Prediction of House Price

Harsh
3 min readMay 28, 2018

--

The Boston Housing Dataset consists of price of houses in various places in Boston. Alongside with price, the dataset also provide information such as

Boston Housing Data set load (Head And Tail )Lookup

Tail view of Dataset

EDA- Exploratory Data Analysis

Summary

Summary of Boston Dataset

Correlation

Heatmap

Linear Regression

plt.figure(figsize=(12,10));
sns.regplot(X, y,robust=True);
plt.xlabel(‘average number of rooms per dwelling’)
plt.ylabel(“Median value of owner-occupied homes in $1000's”)
plt.show();

sns.jointplot(x=’RM’, y=’MEDV’, data=df, kind=’reg’, size=10);
plt.show();

X = df[‘LSTAT’].values.reshape(-1,1)
y = df[‘MEDV’].values
model.fit(X, y)
plt.figure(figsize=(12,10));
sns.regplot(X, y);
plt.xlabel(‘% Lower status of the population’)
plt.ylabel(“Median value of owner-occupied homes in $1000's”)
plt.show();

sns.jointplot(x=’LSTAT’, y=’MEDV’, data=df, kind=’reg’, size=10);
plt.show();

Robust Regression

from sklearn.linear_model import RANSACRegressor
ransac = RANSACRegressor()
ransac.fit(X, y)
RANSACRegressor(base_estimator=None, is_data_valid=None, is_model_valid=None,
loss='absolute_loss', max_skips=inf, max_trials=100,
min_samples=None, random_state=None, residual_metric=None,
residual_threshold=None, stop_n_inliers=inf, stop_probability=0.99,
stop_score=inf)
inlier_mask = ransac.inlier_mask_
outlier_mask = np.logical_not(inlier_mask)
np.arange(3, 10, 1)
line_X = np.arange(3, 10, 1)
line_y_ransac = ransac.predict(line_X.reshape(-1, 1))
#plotsns.set(style='darkgrid', context='notebook')
plt.figure(figsize=(12,10));
plt.scatter(X[inlier_mask], y[inlier_mask],
c='blue', marker='o', label='Inliers')
plt.scatter(X[outlier_mask], y[outlier_mask],
c='brown', marker='s', label='Outliers')
plt.plot(line_X, line_y_ransac, color='red')
plt.xlabel('average number of rooms per dwelling')
plt.ylabel("Median value of owner-occupied homes in $1000's")
plt.legend(loc='upper left')
plt.show()
X = df['LSTAT'].values.reshape(-1,1)
y = df['MEDV'].values
ransac.fit(X, y)
inlier_mask = ransac.inlier_mask_
outlier_mask = np.logical_not(inlier_mask)
line_X = np.arange(0, 40, 1)
line_y_ransac = ransac.predict(line_X.reshape(-1, 1))
sns.set(style='darkgrid', context='notebook')
plt.figure(figsize=(12,10));
plt.scatter(X[inlier_mask], y[inlier_mask],
c='blue', marker='o', label='Inliers')
plt.scatter(X[outlier_mask], y[outlier_mask],
c='brown', marker='s', label='Outliers')
plt.plot(line_X, line_y_ransac, color='red')
plt.xlabel('% lower status of the population')
plt.ylabel("Median value of owner-occupied homes in $1000's")
plt.legend(loc='upper right')
plt.show()

Performance Evaluation of Regression Model

Method 1: Residual Analysis

plt.figure(figsize=(12,8))
plt.scatter(y_train_pred, y_train_pred — y_train, c=’blue’, marker=’o’, label=’Training data’)
plt.scatter(y_test_pred, y_test_pred — y_test, c=’red’, marker=’*’, label=’Test data’)
plt.xlabel(‘Predicted values’)
plt.ylabel(‘Residuals’)
plt.legend(loc=’upper left’)
plt.hlines(y=0, xmin=-10, xmax=50, lw=2, color=’k’)
plt.xlim([-10, 50])
plt.show()

Method 2: Mean Squared Error (MSE)

Method 3: Coefficient of Determination

SSE: Sum of squared errors

SST: Total sum of squares

Thank you

--

--

Harsh

Data Scientist|Machine Learning|Python|R|MYSQL•|AI-NLP