EDA- Exploratory Data Analysis
Summary
Correlation
Heatmap
Linear Regression
plt.figure(figsize=(12,10));
sns.regplot(X, y,robust=True);
plt.xlabel(‘average number of rooms per dwelling’)
plt.ylabel(“Median value of owner-occupied homes in $1000's”)
plt.show();
sns.jointplot(x=’RM’, y=’MEDV’, data=df, kind=’reg’, size=10);
plt.show();
X = df[‘LSTAT’].values.reshape(-1,1)
y = df[‘MEDV’].values
model.fit(X, y)
plt.figure(figsize=(12,10));
sns.regplot(X, y);
plt.xlabel(‘% Lower status of the population’)
plt.ylabel(“Median value of owner-occupied homes in $1000's”)
plt.show();
sns.jointplot(x=’LSTAT’, y=’MEDV’, data=df, kind=’reg’, size=10);
plt.show();
Robust Regression
from sklearn.linear_model import RANSACRegressor
ransac = RANSACRegressor()
ransac.fit(X, y)
RANSACRegressor(base_estimator=None, is_data_valid=None, is_model_valid=None,
loss='absolute_loss', max_skips=inf, max_trials=100,
min_samples=None, random_state=None, residual_metric=None,
residual_threshold=None, stop_n_inliers=inf, stop_probability=0.99,
stop_score=inf)
inlier_mask = ransac.inlier_mask_
outlier_mask = np.logical_not(inlier_mask)
np.arange(3, 10, 1)
line_X = np.arange(3, 10, 1)
line_y_ransac = ransac.predict(line_X.reshape(-1, 1))#plotsns.set(style='darkgrid', context='notebook')
plt.figure(figsize=(12,10));
plt.scatter(X[inlier_mask], y[inlier_mask],
c='blue', marker='o', label='Inliers')
plt.scatter(X[outlier_mask], y[outlier_mask],
c='brown', marker='s', label='Outliers')
plt.plot(line_X, line_y_ransac, color='red')
plt.xlabel('average number of rooms per dwelling')
plt.ylabel("Median value of owner-occupied homes in $1000's")
plt.legend(loc='upper left')
plt.show()
X = df['LSTAT'].values.reshape(-1,1)
y = df['MEDV'].values
ransac.fit(X, y)
inlier_mask = ransac.inlier_mask_
outlier_mask = np.logical_not(inlier_mask)
line_X = np.arange(0, 40, 1)
line_y_ransac = ransac.predict(line_X.reshape(-1, 1))sns.set(style='darkgrid', context='notebook')
plt.figure(figsize=(12,10));
plt.scatter(X[inlier_mask], y[inlier_mask],
c='blue', marker='o', label='Inliers')
plt.scatter(X[outlier_mask], y[outlier_mask],
c='brown', marker='s', label='Outliers')
plt.plot(line_X, line_y_ransac, color='red')
plt.xlabel('% lower status of the population')
plt.ylabel("Median value of owner-occupied homes in $1000's")
plt.legend(loc='upper right')
plt.show()
Performance Evaluation of Regression Model
Method 1: Residual Analysis
plt.figure(figsize=(12,8))
plt.scatter(y_train_pred, y_train_pred — y_train, c=’blue’, marker=’o’, label=’Training data’)
plt.scatter(y_test_pred, y_test_pred — y_test, c=’red’, marker=’*’, label=’Test data’)
plt.xlabel(‘Predicted values’)
plt.ylabel(‘Residuals’)
plt.legend(loc=’upper left’)
plt.hlines(y=0, xmin=-10, xmax=50, lw=2, color=’k’)
plt.xlim([-10, 50])
plt.show()
Method 2: Mean Squared Error (MSE)
Method 3: Coefficient of Determination
SSE: Sum of squared errors
SST: Total sum of squares
Thank you