MLearning.ai
Published in

MLearning.ai

Machine Learning Tutorial

7 Steps to Build a Machine Learning Model with Python

A guide to building a linear regression model using the medical cost personal dataset

Photo by KOBU Agency on Unsplash

Machine Learning is a subset of Artificial Intelligence that provides a machine the ability to learn automatically from experience without being explicitly programmed. Machine learning is a great field. It has applications in many fields. To become an expert in the field of machine learning, you must do a lot of projects.

In this post, I’m going to show how to build a machine learning model step by step. I’ll cover the following topics:

  1. Loading the dataset
  2. Understanding the dataset
  3. Data preprocessing
  4. Data visualization
  5. Building a regression model
  6. Model evaluation
  7. Model prediction

Before getting started, please don’t forget to subscribe to my youtube channel where I create content about AI, data science, machine learning, and deep learning.

Let’s dive in!

1. Loading The Data

Dataset I’m going to use includes charges from patients. I highly recommend that you download this dataset and write your codes with me. You can access my codes from here. First, let’s import the dataset. I’m going to load the dataset using Pandas. Let me first import Pandas.

import pandas as pd

Pandas is an excellent library for data loading and data preprocessing. Now, let me load the dataset with the read_csv method.

data = pd.read_csv("insurance.csv")

Now, let’s take a look at the first rows of the dataset. To do this I’m going to use the head method.

data.head()

As you can see, there are 7 columns such as age, sex, body mass index, number of children, smoking, region, and charges.

2. Understanding The Dataset

Understanding the data is very important before building a machine learning model. For example, let’s see the number of rows and columns of the dataset. I’m going to use the shape attribute to do this.

data.shape

As you can see, the dataset has 1338 rows and 7 columns. Now, I’m going to use the info method to get more information about the dataset.

data.info()

There is no missing data in the dataset. You can also use the isnull method to see the missing data.

data.isnull()

Let me use the sum method to see the sum of the missing data.

data.isnull().sum()

As you can see, there is no missing data in the dataset. Knowing the column types is very important for building a machine learning model. Now, let’s take a look at the column types. I’m going to use the dtypes attribute for this.

data.dtypes

3. Data Preprocessing

Let’s convert object types to category types.

data['sex'] = data['sex'].astype('category')
data['region'] = data['region'].astype('category')
data['smoker'] = data['smoker'].astype('category')

Let’s see the data types again.

data.dtypes

Now, let’s go ahead and take a look at the statistics of numeric variables with the describe method. If we use the transpose of the dataset, you can see the statistics better.

data.describe().T

Now, let’s look at the mean charges for smokers and non-smokers. To do this, let’s first group with the groupby method. I’m going to use the round method to see only two numbers after the comma.

smoke_data = data.groupby("smoker").mean().round(2)

Let’s see smoke data.

smoke_data

As you can see, smokers pay more than non-smokers.

4. Data Visualization

You can understand the dataset better with data visualization. Now, let’s look at the relationships of numeric variables using the seaborn. First, let me import seaborn.

import seaborn as sns

Seaborn is a library that it builds on the matplotlib, especially used for statistical plots. Now, let’s choose the plot style.

sns.set_style("whitegrid")

I’m going to use the pairplot method to see the relationships of numeric variables.

sns.pairplot(
data[["age", "bmi", "charges", "smoker"]],
hue = "smoker",
height = 3,
palette = "Set1")

For example, when the age variable increases, both smokers and non-smokers pay more. Now, let’s look at the correlation between the variables.

sns.heatmap(data.corr(), annot= True)

Notice that there is a relationship between charges and the other variables.

One-Hot Encoding

Now, I’m going to do a one-hot encoding of the categorical variables in the dataset. This is very easy to do with Pandas. You can automatically convert categorical data into one-hot encoding using the get_dummies method in Pandas. Let’s convert categorical data to one-hot encoding.

data = pd.get_dummies(data)

Thus, only categorical data were converted to one-hot encoding. Now let’s look at the columns of the dataset.

data.columns

As you can see, new columns have been created for each subcategory. Using Pandas was very easy. Thanks, Pandas! Thus, the dataset is ready to build the model. Let’s go ahead and build a regression model.

5. Building a Regression Model

When building a model, you should start with the simplest model. If you don’t get good accuracy, you can try more complex models. I’ll build a linear regression model because the output variable charges is numeric type.

Before building a machine learning model, we need to determine the input and output variables. The input variables are features. In statistics, these are called independent variables. The output variable is the target variable. In statistics, this variable is called the dependent variable. Let’s assign the target variable charges to variable y.

y = data["charges"]

If we drop the target variable, the remainders are the features.

X = data.drop("charges", axis = 1)

Before the model is built, the dataset is split into training and testing. The model is built with the training data, and the model is evaluated with the test data. You can use the train_test_split method in scikit-learn to split the dataset into training and testing. With this method, you can easily split the dataset. First, let’s import this method.

from sklearn.model_selection import train_test_split

Let’s split the dataset into 80 percent training and 20 percent testing.

X_train,X_test,y_train,y_test=train_test_split(
X,y,
train_size = 0.80,
random_state = 1)

Now, let’s build the model. Let me import the linear regression class from scikit-learn.

from sklearn.linear_model import LinearRegression

Let me create an instance of the LinearRegression class.

lr = LinearRegression()

I’m going to build the model using the training data.

lr.fit(X_train,y_train)

Beautiful. Our model was built.

6. Model Evaluation

Let’s take a look at the performance of the model. To do this, I’m going to use the coefficient of determination. The closer this value is to 1, the better the model. First, let’s take a look at the score of the model on the test data.

lr.score(X_test, y_test).round(3)#Output: 
0.762

The coefficient of determination on the test data is greater than 0.7. Our model is not bad. Of course, it would be better if it was closer to 1. Now, let’s see the score of the model on the training data.

lr.score(X_train, y_train).round(3)#Output: 
0.748

As you can see, the performance of the model on the training data is close to the performance of the test data. If the performance of the model on the training data was high, it would mean that there is an overfitting problem. You may ask how to solve the overfitting problem? To overcome the overfitting problem, you can use regularization. Ridge or lasso models can be used for this.

Now let’s take a look at another metric, mean squared error, to evaluate the model. For this, let’s first predict the test data with the predict method.

y_pred = lr.predict(X_test)

Now, let’s import the mean_squared_error metric.

from sklearn.metrics import mean_squared_error

I’m going to use this metric now. First, let me import the math module because I’m going to calculate the square root of this metric.

import math

Let’s take a look at the square root of the mean squares error.

math.sqrt(mean_squared_error(y_test, y_pred))#Output: 
5956.45

This value means that the model predicts with a standard deviation of 5956.45.

7. Model Prediction

Now, I’m going to predict the first row as an example. First, let’s select the first row of the training data.

data_new = X_train[:1]

Let me predict the data with our model.

lr.predict(data_new)#Output:
10508. 42

Let’s take a look at the real value.

y_train[:1]#Output:
10355.64

As you can see, our model predicted close to the real value.

Closing Thoughts

I have shown an application using a real-world dataset. As you can see, it is very easy to build a machine learning model with Python libraries. That’s it. I hope you enjoy it. Thank you for reading.

Don’t forget to follow us on YouTube | GitHub | Twitter | Kaggle | LinkedIn

If this post was helpful, please click the clap 👏 button below a few times to show me your support 👇

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store