Predicting Price Using Simple Linear Regression in Python

3 min readJan 20, 2023

Two variables, x and y, are said to have a linear relationship if an increase in one variable causes the other variable to increase or decrease proportionally i.e., the two variables are directly connected and if the relationship is plotted on a graph, it should give you a straight line. Linear relationships are represented mathematically by the following formula.

y = mx + c

where
y - dependent variable
x - independent variable
m - slope
c - y-intercept

One very common example of linear relationship is Celsius to Fahrenheit conversion.

Temperature in F = (Temperature in Celsius * 9/5) + 32

When plotted on a graph, it would look like this:

In this article, we are going to use simple linear regression to predict car prices using this dataset. We are going to use Pandas for data manipulation and scikit-learn for training and predicting prices using the linear regression model. We have to first load this dataset as a dataframe in Pandas to continue working.

ford_df = pd.read_csv("ford.csv")

Refer to this article to learn more about data exploration and analysis using Pandas.

A quick introduction to Exploratory Data Analysis (EDA) using Pandas

A thorough understanding of the data you are working with and the problem at hand is essential for arriving at an…

medium.com

On analysing this dataframe, we will find that some of the features are numerical while others are categorical. In order to fit data into a Linear Regression model, it is required that all of the features are numeric. Therefore, we split the dataframe into numeric and categorical and convert the categorical into numeric representation using get_dummies() method in Pandas.

get_dummies() method is used to encode categorical columns within numerical values. For example, if the original dataset contains a column ‘Gender’ with 3 possible values: Male, Female and Others, this function will create 3 other columns by name Gender_Male, Gender_Female, Gender_Others. If a particular row in the original dataset had Gender as Male, then the value of these 3 columns would be 1,0,0. The values of these columns are limited to 0's and 1's only.

df_numeric = ford_df[['year','price','mileage','mpg','engineSize']]
df_categorical = ford_df[['model','transmission','fuelType']]
df_categorical_encoded = pd.get_dummies(df_categorical, drop_first=True)

After the categorical columns are converted into numeric ones, merge the new columns along with the rest of the original columns. Then split the dataset into training and test set. The ‘price’ column is your label and all the other columns become the predictors/ features.

ford_df_new = pd.concat([df_numeric, df_categorical_encoded], axis=1)
X = ford_df_new.drop('price',axis=1)
y = ford_df_new['price']
X_train, X_test,y_train, y_test = train_test_split(X,y ,
                                   random_state=100, 
                                   test_size=0.20, 
                                   shuffle=True)

Once the train-test split is done, we use sklearn’s Linear Regression, sklearn.linear_model.LinearRegression, to fit the training data to the model. In this case, fitting means using all of the training data to pick the correct line to represent the relationship between price and various predictors. After fitting the model, you can test the model on new data that the model has not seen before.

model = LinearRegression()
lr = model.fit(X_train,y_train)

Now, how do you know how well the data is represented by the model? Well, there’s a metric called coefficient of determination or R-Squared that tells you how well the model predicts an outcome. A score close to 1 indicates that the model is performing well. sklearn.linear_model.LinearRegression has a method called score() that returns the R-squared scored of the prediction. Another way to get this score is to use sklearn.metrics.r2_score.

print(lr.score(X_test,y_test))

y_pred = lr.predict(X_test)
from sklearn.metrics import r2_score
r2 = r2_score(y_test,y_pred)
print(r2)

A score of 0.82 means that 82% of the data points are within the line. A perfectly fitted model gives a score of 1 and a score of 0 indicates that it cannot predict any data correctly.

If you want the model to predict on a few new data points, you can call the predict() method on the fitted model and pass a n*m array as argument, where n is the number of data points for prediction and m is the number of features or predictors.

Hope this article made it easier for you to implement your first Linear Regression model. Please leave your comments here as well.

Predicting Price Using Simple Linear Regression in Python

A quick introduction to Exploratory Data Analysis (EDA) using Pandas

A thorough understanding of the data you are working with and the problem at hand is essential for arriving at an…

Written by Keziya Thomas