Linear Regression with Python

Data Details

Diwakar

Published in

Beer&Diapers.ai

4 min readMay 21, 2019

‘Avg. Area House Age’: Avg Age of Houses in same city

‘Avg. Area Number of Rooms’: Avg Number of Rooms for Houses in same city

‘Avg. Area Number of Bedrooms’: Avg Number of Bedrooms for Houses in same city

‘Area Population’: Population of city house is located in

‘Price’: Price that the house sold at

‘Address’: Address for the house

Check out the data & Import Libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

Check out the Data

USAhousing = pd.read_csv('USA_Housing.csv')USAhousing.head()

USAhousing.info()<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 7 columns):
Avg. Area Income                5000 non-null float64
Avg. Area House Age             5000 non-null float64
Avg. Area Number of Rooms       5000 non-null float64
Avg. Area Number of Bedrooms    5000 non-null float64
Area Population                 5000 non-null float64
Price                           5000 non-null float64
Address                         5000 non-null object
dtypes: float64(6), object(1)
memory usage: 273.5+ KBUSAhousing.describe()

USAhousing.columnsIndex(['Avg. Area Income', 'Avg. Area House Age', 'Avg. Area Number of Rooms',
       'Avg. Area Number of Bedrooms', 'Area Population', 'Price', 'Address'],
      dtype='object')

Exploratory Data Analysis

sns.pairplot(USAhousing)<seaborn.axisgrid.PairGrid at 0x211a5415400>

sns.distplot(USAhousing['Price'])<matplotlib.axes._subplots.AxesSubplot at 0x211a76deba8>

sns.heatmap(USAhousing.corr())<matplotlib.axes._subplots.AxesSubplot at 0x21197c204a8>

Training a Linear Regression Model

X and y arrays

X = USAhousing[['Avg. Area Income', 'Avg. Area House Age', 'Avg. Area Number of Rooms',
               'Avg. Area Number of Bedrooms', 'Area Population']]
y = USAhousing['Price']

Train Test Split

from sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=101)

Creating and Training the Model Permalink

from sklearn.linear_model import LinearRegressionlm = LinearRegression()lm.fit(X_train,y_train)LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

Model Evaluation

# print the intercept
print(lm.intercept_)-2640159.7968526958coeff_df = pd.DataFrame(lm.coef_,X.columns,columns=['Coefficient'])
coeff_df

Interpreting the coefficients:

Holding all other features fixed, a 1 unit increase in Avg. Area Income is associated with an **increase of $21.52 **.
Holding all other features fixed, a 1 unit increase in Avg. Area House Age is associated with an **increase of $164883.28 **.
Holding all other features fixed, a 1 unit increase in Avg. Area Number of Rooms is associated with an **increase of $122368.67 **.
Holding all other features fixed, a 1 unit increase in Avg. Area Number of Bedrooms is associated with an **increase of $2233.80 **.
Holding all other features fixed, a 1 unit increase in Area Population is associated with an **increase of $15.15 **.

Predictions from our Model

predictions = lm.predict(X_test)plt.scatter(y_test,predictions)<matplotlib.collections.PathCollection at 0x211a851e2e8>

Residual Histogram

sns.distplot((y_test-predictions),bins=50);

Regression Evaluation Metrics

Here are three common evaluation metrics for regression problems:

Mean Absolute Error (MAE) is the mean of the absolute value of the errors:

1n∑i=1n∣yi−y^i∣\frac 1n\sum_{i=1}^n|y_i-\hat{y}_i|n1i=1∑n∣yi−y^i∣

Mean Squared Error (MSE) is the mean of the squared errors:

1n∑i=1n(yi−y^i)2\frac 1n\sum_{i=1}^n(y_i-\hat{y}_i)²n1i=1∑n(yi−y^i)2

Root Mean Squared Error (RMSE) is the square root of the mean of the squared errors:

1n∑i=1n(yi−y^i)2\sqrt{\frac 1n\sum_{i=1}^n(y_i-\hat{y}_i)²}n1i=1∑n(yi−y^i)2

Comparing these metrics:

MAE is the easiest to understand, because it’s the average error.
MSE is more popular than MAE, because MSE “punishes” larger errors, which tends to be useful in the real world.
RMSE is even more popular than MSE, because RMSE is interpretable in the “y” units.

All of these are loss functions, because we want to minimize them.

from sklearn import metricsprint('MAE:', metrics.mean_absolute_error(y_test, predictions))
print('MSE:', metrics.mean_squared_error(y_test, predictions))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, predictions)))MAE: 82288.22251914945
MSE: 10460958907.208984
RMSE: 102278.829222909