Creating your First Regression Model

A beginner’s guide to getting started with Machine Learning

Rohit Baney

Published in

Analytics Vidhya

13 min readAug 6, 2021

Image by Mark Fletcher-Brown on Unsplash

If you want access to all the code herein, you can find it here.

Getting started with Data Science can be a daunting task. The field is vast with an enormous amount to learn. When I initially developed an interest in the field, I had no idea where to begin. Numerous online courses, various articles and many anxiety-laden breakdowns later, I have ascertained that the best way to get started is to simply create your first project. The following article will help you do just that.

If you have no idea what machine learning is, and want an introduction to the subject, please check out my article ‘Intro to Machine Learning for the Everyday Person’. Additionally, this article does assume you have a basic understanding of both Python and Jupyter Notebooks.

Additionally, it is important to note that this article is not a comprehensive guide on creating regression models. Its purpose is to introduce the reader to the process of pre-processing data and implementing a regression model. As such, it will focus on breadth and not depth of the concepts involved. By the end of the article, you will have a working understanding of how to prepare data, create three different regression models, feed your data to the model to generate insights and evaluate the accuracy of those insights.

The dataset we are going to use contains information about cars. Our goal is to create a regression model using the data in this dataset to predict car prices. We shall use three different models, Decision Tree Regressor, Linear Regression, and XGBoost Regressor, and evaluate them to determine which model works best for our purpose.

Importing Libraries and Loading Data

The data that we are going to use can be downloaded here. All credit goes to the author of the dataset for the collection and the compilation of the data.

Most commonly, when you are doing a project on pre-compiled data, you will find that the data is given to you in a Comma Separated Values (.csv) format. Before we can use and manipulate this data, we need to import it into our Jupyter Notebook in the form of a dataframe.

Dataframes are data structures in Python that look a lot like excel sheets. They contain rows and columns of information which can be manipulated using code. Typically, every row in a dataframe represents an item, while every column represents information about the item. In our case, our dataframe has 205 rows, each representing a car, and 25 columns, each representing a feature of the cars.

To import the .csv file into a dataframe, we use the ‘pandas’ library. For those people who are unfamiliar with libraries, they are essentially collections of pre-written code that we can import and use. Libraries such as pandas contain useful functions that Data Scientists use to manipulate dataframes. Some other useful libraries that we use are ‘Numpy’, which is used for mathematical functions, ‘Matplotlib’, which is used for creating visuals like graphs and pie-charts, and ‘sklearn’, which houses the Machine Learning algorithms that we are going to use. The use of libraries is essential for anyone who uses Python. They are extremely convenient, easy to use and save you from having to write hours of code yourself.

We import libraries and load data in the following manner:

#Install Libraries#Uncomment (remove ‘#’) from the following 6 lines if you have never installed these libraries before
# !pip install numpy
# !pip install pandas
# !pip install matplotlib
# !pip install seaborn
# !pip install sklearn
# !pip install xgboost#Import Libraries
#Data Manipulation
import numpy as np
import pandas as pd#Visualization
import matplotlib.pyplot as plt
import seaborn as sns#Preprocessing
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split#Models
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import LinearRegression
from xgboost import XGBRegressor#Metrics
from sklearn.metrics import r2_score#Load Data
cars_data = pd.read_csv(‘CarPrice_Assignment.csv’, index_col = ‘car_ID’)
cars_data.head()

We have now loaded the data from a .csv file and saved it in the form of a dataframe called ‘cars_data’. The cars_data.head() method gives us the first 5 rows of the dataframe. From it, we get an idea of what our data looks like. Additionally, cars_data.shape tells us that our dataframe consists of 205 rows, each representing a car, and 25 columns, each representing a feature of the cars.

Note: It is essential that you have the .csv data file in your current working directory and that it’s named ‘CarPrice_Assignment.csv’ for the above code to work.

Data Cleaning and Exploration

Now that we have imported our data in the form of a dataframe, we need to ‘clean’ it. Data found in the real world often contains flaws like missing values, poor formatting and spelling errors. Machine Learning models rarely do well with unclean data, and so, before we begin manipulating data, we need to clean it.

Cleaning data is often the most time consuming part of a data science project. Fortunately for us, the dataset we are using is already squeaky clean. It has no missing values and has no typos. This will almost never be the case for real world data. The techniques you will use to clean a dataset will vary depending on the dataset itself. Spelling errors will have to be fixed, datetime errors will have to be corrected and missing values will have to be filled in. You will inevitably learn more about this as you do more projects, but for now, just keeping in mind that data needs to be cleaned will be enough.

To get information about the dataset, and to check whether it has any missing values that need to be cleaned, you can use the following method:

#Getting Information about the dataset
cars_data.info()

This useful function gives us a lot of information at a single glance. The first column is the column number. The second column gives us all the column names in our dataset. The third column tells us if there are any missing values for each of the columns. As we can see, all our columns have 205 non-null entries, so there are no missing values. The fourth column tells us the datatype housed in each of our columns.

There are various datatypes in our dataframe. They can be categorized into two types: Numerical and Categorical. Every column that had dtype ‘object’ is a categorical column, while every other dtype is a numerical column. Machine Learning models typically only deal with numerical data, and don’t know what to do with categorical data, and so, this is an important distinction to make. We’ll learn how to deal with categorical data later.

Getting acquainted with your data is an important prerequisite to creating a good model. To this end, it would serve us well to be able to collect as much information about the data as possible. A useful way to collect data about the numerical columns is as follows:

#Acquiring information about the numeric data in the dataframe
cars_data.describe()

This table gives us a lot of information about the numerical columns in our data. Using this data, we can make useful observations about our data including information about the distributions of our data using the maximum, minimum, standard deviation and mean of our data. In addition to a numerical analysis, however, it is extremely useful to be able to visualize our data and the relationships therein. A few useful ways of doing this are described below.

#Creating a boxplot
sns.boxplot(x = ‘enginelocation’, y = ‘price’,data = cars_data)

#Creating a linear model plot
sns.lmplot(x = ‘horsepower’, y = ‘price’, hue = ‘fueltype’, data = cars_data)

#Creating a scatter plot
sns.scatterplot(x = ‘citympg’, y = ‘price’, data = cars_data)

Using the above graphs, we can easily tell that:

Cars with engines located in the back are typically a lot more expensive than those with engines located in the front,
Diesel cars see a sharper increase in price as their mileage increases compared to gasoline cars
Expensive cars typically have a low mileage than cheaper cars

Using such visual analysis is extremely beneficial as it gives you a better understanding of the data you have and can be extremely useful when you are ‘feature engineering’.

Feature Engineering

Feature engineering is what separates the good data scientists from the truly great ones. Feature engineering is the process of using existing features to create new, useful features of our data. People with domain knowledge about the data will have a huge advantage over everyone else here because they might know ways to gleam information that the dataset doesn’t have.

For example, if you are trying to predict housing prices, then you might know that the area of your entire house is highly deterministic of the price of the house. If the dataset you have only gives you the area of each room, you can combine this information to create a new feature that gives you the area of the entire house. Your dataset is then richer in information and might outperform other datasets.

There is no one way to feature engineer your dataset, but methods such as ‘principal component analysis’ can be useful here. This could be something you want to focus on as you gain more experience.

To illustrate feature engineering in our example, we are going to combine the ‘carlength’, ‘carwidth’ and ‘carheight’ columns in our dataset to get the total volume of our car. We are going to call this new feature ‘totalsize’. The way we do this is as follows:

#Creating new features
cars_data[‘totalsize’] = cars_data[‘carlength’] * cars_data[‘carwidth’] * cars_data[‘carheight’]#Creating plot
plt.subplots(figsize = (17,3))
sns.boxplot(cars_data[‘totalsize’])
plt.xlabel(‘Total Size of Car’, size = 15)

Encoding

Remember how we had categorical values in our dataset and I had mentioned that machine learning models only take in numerical data? Encoding is the answer to this problem. Encoding is the process of converting categorical data into numerical data. There are multiple ways of doing this. The two most popular methods are ‘one-hot encoding’ and ‘label encoding’.

Label encoding is the easier of the two methods. It assigns a number to each of the categories in our column and replaces that category with the number. For example, in our dataset, the ‘fueltype’ column has two categories: gas and diesel. Label encoding replaces diesel with ‘0’ and gas with ‘1’. Simple, right? However, this method has its drawbacks. When we have more than 2 categories in our columns, each category gets a number (1,2,3,4…). Unfortunately, machine learning models interpret this to mean that categories with higher numbers are in some way more important than categories with lower numbers. This might not be true at all and might ultimately hurt performance. This is where our second encoding method comes in.

One-hot encoding is the more complex of the two methods. It involves creating a number of new columns depending on the number of categories in each column. Each one of these columns represents a single category from the original categorical column. It contains a 1 if the row in the original column contains the category it represents and a 0 otherwise. The image below makes this a little clearer.

For simplicity, we are not going to use One-Hot Encoding in our example. Instead, we are going to use Label encoding throughout. This is not ideal, but it’ll still work while giving us a less complex view of what is happening to our dataframe. We implement label encoding as follows:

#Encoding the categorical data
categorical_columns = cars_data.select_dtypes(‘object’).columns #Extract all the column names with categorical data
label_encoder = LabelEncoder() #Instantiate Label Encoder#Go through all columns in the dataframe with categorical data and replace categories with numeric datafor column in categorical_columns:
    temp = label_encoder.fit_transform(cars_data[column])
    cars_data[column] = temp
    cars_data[categorical_columns] #Image1#Checking to see if all the data is in a numeric format
cars_data.dtypes #Image2

All our data has now been encoded using label encoding and is in a numeric format.

Regression Model

Before we actually feed the data we have into our regression models, it is standard practice to first split the data into training data and testing data. As the name suggests, we train our model on the training data and evaluate its performance on the testing data. We do this so that we can evaluate how our model performs on unseen data.

If we train our model on the same data we evaluate it on, our model might over-fit to our data leading it to perform poorly on unseen, real-world data. Our goal is to be able to predict prices of cars that aren’t in our dataset and so, it is important for our model to generalize well and not over-fit on the data we have. Thus, we evaluate our model’s performance on data that it isn’t trained on. We split our data in the following way:

#Spliting data into features and label
x = cars_data.drop(columns = [‘price’]) #Features
y = cars_data[‘price’] #Label#Splitting the data into Training and Testing data
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.3, random_state = 32)print(“Shapes of dataframes: \n”,
f’x_train:\t{x_train.shape} \n’,
f’x_test:\t{x_test.shape} \n’,
f’y_train:\t{y_train.shape} \n’,
f’y_test:\t{y_test.shape}’)

The ‘x’ variable contains all the features of our dataframe, that is, all the information that we are going to use to predict the price of the car. ‘y’ contains the price of the cars in our dataframe and is also called the ‘label’ or ‘target’. After splitting our dataset into training and testing data, we can see that that the testing data contains about 30% of our overall data.

Decision Tree Regressor

Now, at long last, we are ready to create our first regression model - DecisionTreeRegressor. We do so as follows:

#Fitting the data to a Decision Tree Model
dtr = DecisionTreeRegressor()
dtr.fit(x_train,y_train) #Fitting the data
y_pred = dtr.predict(x_test) #Creating a prediction
print(f”{r2_score(y_pred, y_test):.4f}”) #Evaluating results

Image by author

In just four lines of code, we initialize, fit, predict and evaluate our regression model. Creating a machine learning model is often the easiest part of the whole process. The challenge comes in optimizing the model and knowing which model to use in which situation. If you are interested to know what happens behind the scenes when you call the above functions, I recommend you read this article. It explains the math behind regression models very well. For someone who doesn’t care about the math and only the execution of the model, this is all you will need.

The metric we use to evaluate our model is called the ‘r2_score’. There are numerous metrics you can use. Some make more sense depending on the problem at hand. I chose the ‘r2_score’ because it is easy to interpret. The closer to 1 your score is, the better your model is performing.

Linear Regression

Now, let us run a different model called ‘Linear Regression’ and see how that performs.

#Fitting the data to the linear regression model and evaluating the prediction
lr = LinearRegression() #Instantiating the model
lr.fit(x_train, y_train) #Fitting the data
y_pred = lr.predict(x_test) #Creating a prediction
print(f”{r2_score(y_pred, y_test):.4f}”) #Evaluating results

Image by author

Our Linear Regression model performed slightly worse than our Decision Tree model. This might not always be the case depending on the dataset at hand. Therefore, it important to use and evaluate multiple regression models to find the best fit for your data.

XGBoost Regressor

Our third model is called XGBoostRegressor. XGBoost is an extremely powerful model that has been used to win numerous datascience competitions in the recent past. It often outperforms other models. Let’s see how it performs on our dataset.

#Fitting the data to the XGBoost model and evaluating the prediction
xgbr = XGBRegressor()
xgbr.fit(x_train, y_train) #Fiting the data
y_pred = xgbr.predict(x_test) #Creating a prediction
print(f”{r2_score(y_pred, y_test):.4f}”) #Evaluating results

Image by author

As expected, our XGBoostRegressor model outperformed both our other models. It is important to note, however, that the models that we used have not been optimized in any way. The next step to improving our model would be to learn about ‘hyperparameters’ and to optimize these within our model. However, I shall leave that for another time.

Conclusion

Creating your first Machine Learning model can be an overwhelming task. Hopefully, I have managed to give you a preliminary understanding of the steps involved in creating a machine learning model.

This article is in no way exhaustive. Every step above could be expanded into articles of their own. Special care should be given to the data cleaning step, because it is the most time consuming step and is also the most important. A model will perform only as well as the data that is given to it, and unclean data will almost never perform well at all.

As you are now familiar with the steps involved in creating a machine learning model, I encourage you to practice what you have learnt as much as possible. A really good place to start would be to work with the titanic dataset, which is a beginner friendly dataset. It is a fair bit more challenging than the dataset we used in this article, but following the steps listed in this article should get your through it.

If you have any questions/ comments/ suggestions, feel free to reach out to me on LinkedIn. If you want access to the entire code notebook, you can find it on my GitHub.

Thank you for reading and good luck with your data science journey!

Creating your First Regression Model

A beginner’s guide to getting started with Machine Learning

Importing Libraries and Loading Data

Data Cleaning and Exploration

Feature Engineering

Encoding

Regression Model

Decision Tree Regressor

Linear Regression

XGBoost Regressor

Conclusion

Written by Rohit Baney