Boston Housing: Prediction of House Price

Harsh

3 min readMay 28, 2018

The Boston Housing Dataset consists of price of houses in various places in Boston. Alongside with price, the dataset also provide information such as

Boston Housing Data set load (Head And Tail )Lookup

EDA- Exploratory Data Analysis

Summary

Correlation

Heatmap

Linear Regression

plt.figure(figsize=(12,10));
sns.regplot(X, y,robust=True);
plt.xlabel(‘average number of rooms per dwelling’)
plt.ylabel(“Median value of owner-occupied homes in $1000's”)
plt.show();

sns.jointplot(x=’RM’, y=’MEDV’, data=df, kind=’reg’, size=10);
plt.show();

X = df[‘LSTAT’].values.reshape(-1,1)
y = df[‘MEDV’].values
model.fit(X, y)
plt.figure(figsize=(12,10));
sns.regplot(X, y);
plt.xlabel(‘% Lower status of the population’)
plt.ylabel(“Median value of owner-occupied homes in $1000's”)
plt.show();

sns.jointplot(x=’LSTAT’, y=’MEDV’, data=df, kind=’reg’, size=10);
plt.show();

Robust Regression

from sklearn.linear_model import RANSACRegressor
ransac = RANSACRegressor()
ransac.fit(X, y)
RANSACRegressor(base_estimator=None, is_data_valid=None, is_model_valid=None,
        loss='absolute_loss', max_skips=inf, max_trials=100,
        min_samples=None, random_state=None, residual_metric=None,
        residual_threshold=None, stop_n_inliers=inf, stop_probability=0.99,
        stop_score=inf)
inlier_mask = ransac.inlier_mask_
outlier_mask = np.logical_not(inlier_mask)
np.arange(3, 10, 1)
line_X = np.arange(3, 10, 1)
line_y_ransac = ransac.predict(line_X.reshape(-1, 1))#plotsns.set(style='darkgrid', context='notebook')
plt.figure(figsize=(12,10));
plt.scatter(X[inlier_mask], y[inlier_mask], 
            c='blue', marker='o', label='Inliers')
plt.scatter(X[outlier_mask], y[outlier_mask],
            c='brown', marker='s', label='Outliers')
plt.plot(line_X, line_y_ransac, color='red')
plt.xlabel('average number of rooms per dwelling')
plt.ylabel("Median value of owner-occupied homes in $1000's")
plt.legend(loc='upper left')
plt.show()

X = df['LSTAT'].values.reshape(-1,1)
y = df['MEDV'].values
ransac.fit(X, y)
inlier_mask = ransac.inlier_mask_
outlier_mask = np.logical_not(inlier_mask)
line_X = np.arange(0, 40, 1)
line_y_ransac = ransac.predict(line_X.reshape(-1, 1))sns.set(style='darkgrid', context='notebook')
plt.figure(figsize=(12,10));
plt.scatter(X[inlier_mask], y[inlier_mask], 
            c='blue', marker='o', label='Inliers')
plt.scatter(X[outlier_mask], y[outlier_mask],
            c='brown', marker='s', label='Outliers')
plt.plot(line_X, line_y_ransac, color='red')
plt.xlabel('% lower status of the population')
plt.ylabel("Median value of owner-occupied homes in $1000's")
plt.legend(loc='upper right')
plt.show()

Performance Evaluation of Regression Model

Method 1: Residual Analysis

plt.figure(figsize=(12,8))
plt.scatter(y_train_pred, y_train_pred — y_train, c=’blue’, marker=’o’, label=’Training data’)
plt.scatter(y_test_pred, y_test_pred — y_test, c=’red’, marker=’*’, label=’Test data’)
plt.xlabel(‘Predicted values’)
plt.ylabel(‘Residuals’)
plt.legend(loc=’upper left’)
plt.hlines(y=0, xmin=-10, xmax=50, lw=2, color=’k’)
plt.xlim([-10, 50])
plt.show()