Linear Regression with Python
Data Details
‘Avg. Area House Age’: Avg Age of Houses in same city
‘Avg. Area Number of Rooms’: Avg Number of Rooms for Houses in same city
‘Avg. Area Number of Bedrooms’: Avg Number of Bedrooms for Houses in same city
‘Area Population’: Population of city house is located in
‘Price’: Price that the house sold at
‘Address’: Address for the house
Check out the data & Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
Check out the Data
USAhousing = pd.read_csv('USA_Housing.csv')USAhousing.head()
USAhousing.info()<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 7 columns):
Avg. Area Income 5000 non-null float64
Avg. Area House Age 5000 non-null float64
Avg. Area Number of Rooms 5000 non-null float64
Avg. Area Number of Bedrooms 5000 non-null float64
Area Population 5000 non-null float64
Price 5000 non-null float64
Address 5000 non-null object
dtypes: float64(6), object(1)
memory usage: 273.5+ KBUSAhousing.describe()
USAhousing.columnsIndex(['Avg. Area Income', 'Avg. Area House Age', 'Avg. Area Number of Rooms',
'Avg. Area Number of Bedrooms', 'Area Population', 'Price', 'Address'],
dtype='object')
Exploratory Data Analysis
sns.pairplot(USAhousing)<seaborn.axisgrid.PairGrid at 0x211a5415400>
sns.distplot(USAhousing['Price'])<matplotlib.axes._subplots.AxesSubplot at 0x211a76deba8>
sns.heatmap(USAhousing.corr())<matplotlib.axes._subplots.AxesSubplot at 0x21197c204a8>
Training a Linear Regression Model
X and y arrays
X = USAhousing[['Avg. Area Income', 'Avg. Area House Age', 'Avg. Area Number of Rooms',
'Avg. Area Number of Bedrooms', 'Area Population']]
y = USAhousing['Price']
Train Test Split
from sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=101)
Creating and Training the Model Permalink
from sklearn.linear_model import LinearRegressionlm = LinearRegression()lm.fit(X_train,y_train)LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)
Model Evaluation
# print the intercept
print(lm.intercept_)-2640159.7968526958coeff_df = pd.DataFrame(lm.coef_,X.columns,columns=['Coefficient'])
coeff_df
Interpreting the coefficients:
- Holding all other features fixed, a 1 unit increase in Avg. Area Income is associated with an **increase of $21.52 **.
- Holding all other features fixed, a 1 unit increase in Avg. Area House Age is associated with an **increase of $164883.28 **.
- Holding all other features fixed, a 1 unit increase in Avg. Area Number of Rooms is associated with an **increase of $122368.67 **.
- Holding all other features fixed, a 1 unit increase in Avg. Area Number of Bedrooms is associated with an **increase of $2233.80 **.
- Holding all other features fixed, a 1 unit increase in Area Population is associated with an **increase of $15.15 **.
Predictions from our Model
predictions = lm.predict(X_test)plt.scatter(y_test,predictions)<matplotlib.collections.PathCollection at 0x211a851e2e8>
Residual Histogram
sns.distplot((y_test-predictions),bins=50);
Regression Evaluation Metrics
Here are three common evaluation metrics for regression problems:
Mean Absolute Error (MAE) is the mean of the absolute value of the errors:
1n∑i=1n∣yi−y^i∣\frac 1n\sum_{i=1}^n|y_i-\hat{y}_i|n1i=1∑n∣yi−y^i∣
Mean Squared Error (MSE) is the mean of the squared errors:
1n∑i=1n(yi−y^i)2\frac 1n\sum_{i=1}^n(y_i-\hat{y}_i)²n1i=1∑n(yi−y^i)2
Root Mean Squared Error (RMSE) is the square root of the mean of the squared errors:
1n∑i=1n(yi−y^i)2\sqrt{\frac 1n\sum_{i=1}^n(y_i-\hat{y}_i)²}n1i=1∑n(yi−y^i)2
Comparing these metrics:
- MAE is the easiest to understand, because it’s the average error.
- MSE is more popular than MAE, because MSE “punishes” larger errors, which tends to be useful in the real world.
- RMSE is even more popular than MSE, because RMSE is interpretable in the “y” units.
All of these are loss functions, because we want to minimize them.
from sklearn import metricsprint('MAE:', metrics.mean_absolute_error(y_test, predictions))
print('MSE:', metrics.mean_squared_error(y_test, predictions))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, predictions)))MAE: 82288.22251914945
MSE: 10460958907.208984
RMSE: 102278.829222909