My First Machine Learning Algorithm

And maybe yours too

7 min readSep 29, 2022

Hi! I’m Douglas and I’m a Data Science enthusiast trying to learn everything I think I need to become a Data Scientist.

I took a small step outside my studying plans and did a very quick Kaggle course on Intro to Machine Learning. There they use the famous housing prices example with both Melbourn and Iowa data and guide you through creating your first regression model with the Decision Tree and Random Forest algorithms.

Well, if you read my About me post you know this isn’t really my first machine learning algorithm, I actually have a bit of experience with machine learning. But this is my first time using Decision Tree and Random Forest algorithms and it might as well be your first algorithm ever, so let’s get right into it.

The Data

For this tutorial-ish article, I chose not to use the same housing prices dataset used in the Kaggle course. Instead, I went on Kaggle itself and chose a Used Car Prices Prediction dataset, a pretty new one I might say. To take a quick look into it we are going to use the Pandas library.

import pandas as pd

dataset = pd.read_csv("car data.csv")
print(dataset.shape)
dataset.head()(301, 9)

First of all, we see the shape of our dataset: 301 rows and 9 columns. Each one of those rows represents a used car, and each of those columns represents a feature from those cars. We see we have numerical features (Year, Selling_Price, Present_Price, Driven_kms, and Owner) and categorical features (Car_Name, Fuel_Type, Selling_Type, and Transmission).

The purpose of this article is not to engage in the matters of Data Cleaning or Data Wrangling. But hence this is not a dataset created with the purpose of being used in a tutorial, it might (and does) have some small problems we might want to solve. And that, fortunately, are pretty easy to solve. First of all, we check for null (or, rather, missing) values in our dataset.

dataset.isna().sum()Car_Name         0
Year             0
Selling_Price    0
Present_Price    0
Driven_kms       0
Fuel_Type        0
Selling_type     0
Transmission     0
Owner            0
dtype: int64

That assures us that in no column is there an empty or missing value. That is definitely a relief.

Further on, we check for duplicates in the dataset.

dataset.duplicated().sum()2

Two duplicates mean we have 2 cars in our dataset that are, each one, equal to some other car (not necessarily equal to one another). That will be a problem if it comes to affecting the training of our model. Thus, we are going to drop those duplicates and reassure ourselves that there are no other ones.

dataset.drop_duplicates(inplace=True)
print(dataset.duplicated().sum())
print(dataset.shape)0
(299, 9)

We can see we have now 2 fewer cars in our dataset, but that is good because now each observation is completely singular.

Creating our first Decision Tree

So, to create, train and test a Decision Tree (or any Machine Learning Model, for that matter) we first need to know what we want to predict and what features we will use to try and predict it. In this case, I want to predict the Present Price of the used car and thus I save that single column in what would be a labels table called y.

y = dataset['Present_Price']
print(y.shape)
y.head()(299,)





0    5.59
1    9.54
2    9.85
3    4.15
4    6.87
Name: Present_Price, dtype: float64

Next, in order to decide which features we will use, I’ve used a simple decisive factor: we will not engineer features. With that, I mean that, since Decision Trees, even though they should, do not handle categorical data (or, at least, the one we are using, from Scikit-learn doesn’t) we will dump categorical features from our model. In other words, the only features we are going to use are the numerical ones: Year, Selling price, Driven kilometers, and owner, which is actually categorical, but also numerical.

X = dataset[["Year", "Selling_Price", "Driven_kms", "Owner"]]
print(X.shape)
X.head()(299, 4)

To, later on, evaluate our model with data it has not seen before, we will use the Scikit-learns train-test splitter function to have a randomized split between data to train the model and data to test it later. We will set a random state to it with the number 7 so it runs exactly the same every time we run it.

from sklearn.model_selection import train_test_split


X_train, X_test, y_train,  y_test = train_test_split(X, y, random_state=7)

Now, to actually create our model once and for all, we will use a Scikit-learns class that runs a Decision Tree for Regression (predicting continuous values) called, naturally, DecisionTreeRegressor. We will then fit it — or train it, as you wish — with the training data. At last, we will predict the labels for the X_test data and compare it to the actual labels for those cars, calculating the mean absolute error of that comparison so we can have an idea of how well (or poorly) our model did.

from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_absolute_error


default_DT = DecisionTreeRegressor(random_state=7)
default_DT.fit(X_train, y_train)
mae_df = mean_absolute_error(y_test, default_DT.predict(X_test))
print(mae_df)1.6693333333333331

Seeing an error of merely 1.67 leads us into thinking this is a fantastic model. But, if we analyze the data we will see that that number is not far from the average of the labels we want to predict, that is, the present price of the cars.

average_present_price = y.mean()
print(average_present_price)

relative_error = round(mae_df/average_present_price *100, 2)
print(relative_error, "%")7.541036789297662
22.14 %

As you can see, that error actually means a 22.14% average error in the predictions for that dataset. Is it enough? It depends on the scenario, how much time and money one can spend to further improve this model, and far more other factors. In our case, we still want to improve it by as much as we can (or at least for as much as this Kaggle course has taught us to).

Tuning the Parameters

Yes, I am aware the correct name is hyperparameters, but Kaggle did not get into that matter so neither will I. If you are unfamiliar, the parameters we are talking about are some characteristics of the Decision Tree that can interfere with its performance. We will not talk much about Overfitting or Underfitting here, but those two scenarios can happen and often worse the models' performance. In the case of Decision Trees, one parameter we can adjust to improve the overall performance by fleeing from these problems is the maximum number of leaves. Increasing the number of leaves will lead the model to fit too much to the training data and perform poorly on other data such as the test data. On the other hand, if we have only a small number of leaves, it will not fit our data at all and perform poorly on the train and test data alike.

So, to try and solve that problem and find out if it really is what is hurting our models’ performance, we shall test a few possibilities for the maximum number of leaves and choose the one with the best performance, or the smaller mean absolute error.

import numpy as np
max_leaf_sizes = [5, 10, 50, 100, 500]
maes = []

for leaf_size in max_leaf_sizes:
    DT = DecisionTreeRegressor(random_state=7, max_leaf_nodes=leaf_size)
    DT.fit(X_train, y_train)
    mae = mean_absolute_error(y_test, DT.predict(X_test))
    print(f'Max Leafs: {leaf_size} | MAE: {mae}')

    maes.append(mae)

best_leaf_size = max_leaf_sizes[np.argmin(maes)]
best_mae = np.min(maes)
print()
print(f'Best Leaf Size: {best_leaf_size} | Best MAE: {best_mae}')Max Leafs: 5 | MAE: 2.3890808214008215
Max Leafs: 10 | MAE: 2.5523355559075074
Max Leafs: 50 | MAE: 1.6066773766058149
Max Leafs: 100 | MAE: 1.6366038095238098
Max Leafs: 500 | MAE: 1.6527999999999998

Best Leaf Size: 50 | Best MAE: 1.6066773766058149--------------------------------------------------------------------relative_error = round(best_mae/average_present_price *100, 2)
print(relative_error, "%")21.31 %

We see the error has slightly decreased. In some cases, a 1% improvement may be something to party over. But, in our case, that are still some things we can do to further improve this model.

Random Forests

The Ensemble theory says (and proves) that, if you properly join a group of more than one model with different parameters (or, rather, hyperparameters, if you shall) and use them to get a joint result, you will always get better performance than with just one model (at least that is what the theory says). Random Forests are an Ensemble application with just Decision Trees as building blocks for a greater joint model. It creates a large number of models and uses them in conjunction to predict our label and, in theory, it should work best than even the better-performing single Decision Tree. We will use the Scikit-learns RandomForestRegressor class for this test.

from sklearn.ensemble import RandomForestRegressor


RF = RandomForestRegressor(random_state=7)
RF.fit(X_train, y_train)
mae_rf = mean_absolute_error(y_test, RF.predict(X_test))
print(mae_rf)1.217540000000001relative_error = round(mae_rf/average_present_price *100, 2)
print(relative_error, "%")16.15 %

As one can see, the result is clearly better than the one we had just a moment ago. Is it good yet? Probably not. Is it enough? As I’ve said, it depends on the scenario. Will we try to improve it further? Not now at least. We have used here everything we have learned this far from that introductory Kaggle course. Spoiler: the next step may be encoding the categorical data and normalizing the numerical ones, in other words, feature engineering. But that will only come at another time.

In any case, this is my quick tutorial/walkthrough on how to create your first Decision Tree Machine Learning algorithm to predict the price of used cars. The notebook file will be available for download on my GitHub here alongside the dataset used. I hope this helped you in any way but really, as I always say, even if it didn’t help you, it helped me tremendously so it’s worth it. See you soon!