Car Price Prediction with Machine Learning Models (Part 2)

David Obembe
Analytics Vidhya
Published in
7 min readJul 20, 2020

In my last article, I gave a gentle overview of what Machine Learning was about and the framework of building a Machine Learning model. The data was cleaned and some Exploratory Data Analysis were carried out, in order to gain insight into the data. In this article, we will be continuing from there. If you’ve not seen the last post, click here to read it, so you can follow along easily.

Photo by Erwann Letue on Unsplash

Feature Engineering

Feature engineering is a vital preprocessing technique in building any machine learning model. It involves pulling out some features from the data using domain knowledge. First, the ‘location’ and ‘engine’ columns were dropped from the dataframe. The ‘location’ because it does not affect the price. The ‘engine’, on the other hand, because the entries were mostly unique, so the algorithm does not learn from the feature.

ML algorithms deal purely with numerical inputs and outputs. Therefore, the string-type/categorical features were converted to numerical data. This was done using One Hot encoding which creates separate columns for every entry in a column and fills in 1 to indicate an entry’s presence and 0 if otherwise. The get_dummies method performed this operation.

data = data.drop(['Location', 'Engine'], axis=1)
cat_features = [x for x in data.columns if data[x].dtype == 'O']
data = pd.get_dummies(data, cat_features)
One Hot Encoding. (Source: Kaggle)

The data was then split into train and test data. The ML model learns with the train data while the test data, as the name implies, is used to check how well the model has learnt. Outliers and other anomalous data distribution were handled with Feature Scaling. Features Scaling compresses the data within a particular range of values. Features can either be standardized or normalized. In this case, the data was normalized using the StandardScaler class. The scaling was fitted on the train data alone since the test data should be treated as unseen data.

from sklearn.preprocessing import StandardScaler
# Define the features or independent variable
X = data.drop(['Price'], axis=1)
# Define the label or dependent variable
y = data['Price']
# Split the data into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Apply Standard Scaler to the train data
scaler = StandardScaler()
scaler.fit(X_train[continuous_features[0:-1]])
# Transform Standard Scaler to the train and test data
X_train[continuous_features[0:-1]] = scaler.transform(X_train[continuous_features[0:-1]])
X_test[continuous_features[0:-1]] = scaler.transform(X_test[continuous_features[0:-1]])

Now that our data looks good for the ML algorithms, let’s evaluate how some common regression algorithms will perform on the data.

Model Evaluation

Five algorithms were evaluated: linear regressor, ridge regressor, lasso regressor, decision tree and random forest regressor.

  1. Linear regressor finds a line that best fits all the features and predicts the output for a completely different input
  2. Ridge regressor uses an L2 regularization technique to reduce the complexity of the model by coefficient shrinkage
  3. Lasso regressor uses an L1 regularization technique to completely eliminate some features and make predictions
  4. A decision tree regressor splits the data into smaller subdivisions. It asks questions about the data and provides yes or no answers. Each answer improves the confidence for a correct prediction
  5. Decision tree aggregates the predictions of various random forest results to select a more stable prediction

K-fold cross-validation was used to check how these models would perform on our data. Cross-validation is necessary because it produces results with low bias and low variance. In cross-validation, the data is split in different folds are each fold is used as test data and the rest as train data. The average result from each fold, therefore, reveals how the algorithm performs across the board.

K-fold cross-validation (Source: Research Gate)

10 number of splits were used and the scoring was an r-squared (also called r2) score, a measure of how close the test data is with the trained.

r2 formula. Source: Seattle Data Guy
# Create a list of ML algorithms
models = []
models.append(('Linear Regression', LinearRegression()))
models.append(('Ridge Regression', Ridge()))
models.append(('Lasso Regression', Lasso()))
models.append(('Decision Tree', DecisionTreeRegressor()))
models.append(('Random Forest', RandomForestRegressor()))
# Evalaute each model
for name, model in models:
cv = KFold(n_splits=10, shuffle=True, random_state=1)
score = cross_val_score(model, X, y, cv=cv, scoring='r2')
print(f"{name} has an r2 score: {np.round(score.mean(), 3)}%, and SD : {np.round(score.std(), 4)}")

From the result, you will quickly see that random forest would perform best on the data, having an r-squared of 89.55%. Consequently, the random forest regressor was adopted.

# Check for model performance on the test data
rf = RandomForestRegressor()
%time rf.fit(X_train, y_train)
y_pred = rf.predict(X_test)
print(f'r2 score: {np.round(r2_score(y_test, y_pred), 4)*100}%')
print(f'Mean absoluute error: {metrics.mean_absolute_error(y_test, y_pred)}')
print(f'Mean sqaured error: {metrics.mean_squared_error(y_test, y_pred)}')

Upon fitting with the data, the r2 score was 87.32% which was pretty good. Checking out other metrics, the mean absolute error (1.60) and the mean squared error (14.87) were quite low, which was indeed impressive.

Hyperparameter Tuning

Having an r2 score of 87.32% is good but it can still be bumped up a little. A great way of improving the result is by tweaking the algorithms’ hyperparameters. GridSearchCV and RandomizedSearchCV parse through different combinations of hyperparameters and obtains the combinations that produce the best result. RandomizedSearchCV is however less computationally intensive so it was used to obtain the best hyperparameters.

After a list of values was defined for each parameter, RandomizedSearchCV found the best estimators and parameters for the data. Using these parameters, the r-squared score increased to 87.93%. This was adopted as the final model.

# Number of trees in the forest
n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)]
# Number of features to consider at every split
max_features = ['sqrt', 'auto']
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(5, 50, num = 10)]
max_depth.append(None)
# Minimum number of samples required to split a node
min_samples_split = [2, 3, 5]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 3]
# Method of selecting samples for training each tree
bootstrap = [True, False]
# Define some paramters
params = {'n_estimators': n_estimators,
'max_features': max_features,
'max_depth': max_depth,
'min_samples_split': min_samples_split,
'min_samples_leaf': min_samples_leaf,
'bootstrap': bootstrap}
# Apply RandomizedSearchCV with the defined paramters
model_search = RandomizedSearchCV(rf, param_distributions=params, scoring=’r2’)
%time model_search.fit(X_train, y_train)
y_pred_op = model_search.predict(X_test)
# Check metrics
print(r2_score(y_test, y_pred_op))
# Print the best combination

To tie everything together, a function was defined to estimates the price using the final model.

# Define a function that implements the model
def predict_price(name, year, km, fuel, transmission, owner, mileage, power, seats):
# Define column location of the non numerical features
name_index = np.where(X.columns=='Name_'+name.upper())[0][0]
fuel_index = np.where(X.columns=='Fuel_Type_'+fuel)[0][0]
transmission_index = np.where(X.columns=='Transmission_'+transmission)[0][0]
owner_index = np.where(X.columns=='Owner_Type_'+owner)[0][0]
# Assign each of the inputted feature its value
x = np.zeros(len(X.columns))
x[0] = year
x[1] = km
x[2] = mileage
x[3] = power
x[4] = seats
if name_index >= 0:
x[name_index] = 1
if fuel_index >= 0:
x[fuel_index] = 1
if transmission_index >= 0:
x[transmission_index] = 1
if owner_index >= 0:
x[owner_index] = 1

return f'The estimated price of the car is {model_search.predict([x])[0]} Lakh Rupees'

Check out some of the estimations.

This final model can now be saved and used in a completely unseen data. Moreso, it can be deployed in production using any of the cloud platforms available.

# Save the model
pickle.dump(model_search, open('model_final.plk', 'wb'))

Conclusion and Future Work

We went through how to build an ML model for car price prediction using 5 ML algorithms. We started off by cleaning the data, carrying out data visualizations, and data preprocessing before feeding the data to the ML algorithms. It was observed that the decision tree and Random forest both had good results for the data but the random forest performed better. Well, this is because the random forest is an aggregation of decision trees. The aggregation reduces bias and limit overfitting, therefore producing a better result. A trade-off is random forest takes more time for computation and is more prone to overfitting.

Some other ideas can be implemented for this project. Some of the features in the data that were strongly correlated, such as, the mileage and age can be merged together. Principal Component Analysis is an unsupervised learning approach that can be used to reduce the number of features. Also, other hyperparameters can be checked with GridSearchCV using a wider range of parameters. With a larger dataset, the specific model of the car can be used rather than its brand.

There you have it. If you want to try your hands on this project, you will find the Notebook file here. Thanks for your time.

--

--