Analytics Vidhya
Published in

Analytics Vidhya

Solving Regression Problems

Photo by Kendal on Unsplash
Dataset Used: https://www.kaggle.com/mirichoi0218/insurance
My Notebook: https://www.kaggle.com/tanishsawant2002/regression-85-score

Description Of The Dataset: (An excerpt from Kaggle)

Context

Machine Learning with R by Brett Lantz is a book that provides an introduction to machine learning using R. As far as I can tell, Packt Publishing does not make its datasets available online unless you buy the book and create a user account which can be a problem if you are checking the book out from the library or borrowing the book from a friend. All of these datasets are in the public domain but simply needed some cleaning up and recoding to match the format in the book.

Content

Columns

  • age: age of primary beneficiary
  • sex: insurance contractor gender, female, male
  • bmi: Body mass index, providing an understanding of body, weights that are relatively high or low relative to height,
    objective index of body weight (kg / m ^ 2) using the ratio of height to weight, ideally 18.5 to 24.9
  • children: Number of children covered by health insurance / Number of dependents
  • smoker: Smoking
  • region: the beneficiary’s residential area in the US, northeast, southeast, southwest, northwest.
  • charges: Individual medical costs billed by health insurance

Acknowledgements

The dataset is available on GitHub here.

Inspiration

Can you accurately predict insurance costs?

Ok, Enough about the dataset, now let’s get going with Pandas.

Let’s import the data as a pandas DataFrame

df = pd.read_csv("/kaggle/input/insurance/insurance.csv")
df.head(10)
Output
Output

I usually prefer to store the column names as a list. (As some features might be changed in future).

cols = list(df.columns)

Describe the numerical features

df.describe().T
Output

Checking for missing values

df.isna().sum()
Output

There are no missing values in the dataset🥳🥳

We can have a quick look at the relation between each features by using seaborn’s pairplot function.

(Seaborn is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.)

import seaborn as sns
import matplotlib.pyplot as plt
sns.pairplot(df)
plt.show()
Pairplot

By observing the dataset, it is clear that the “sex”, “region” and “smoker” are string datatype columns and can be one hot encoded.

Panda’s get_dummies() method can be used for this purpose.

df["sex"] = pd.get_dummies(df["sex"])
df["smoker"] = pd.get_dummies(df["smoker"])
df["region"] = pd.get_dummies(df["region"])
df.head()
Now the data is nicely encoded.

We are done with preprocessing and ready to go for building the model.

But wait, one more crucial step is remaining. (Split the dataset)

All columns except “charges” can be added into features table.

X = df[set(cols)-set(["charges"])]
y = df["charges"]

Splitting data into train and test:

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)

Building Models:

Here I have appended many models to the array so that it get easy to compare the performance. Generally, regression score is measured in terms of r2_score or squared error.

from sklearn import linear_model

reg = linear_model.LinearRegression()
elas = linear_model.ElasticNet()
lasso = linear_model.Lasso()
huber = linear_model.HuberRegressor()

models = [reg, elas, lasso, huber]

print(reg)
for model in models:
print(model)
model.fit(X_train, y_train)
print(model.score(X_test, y_test))
Name of model: score

Using RandomForestRegressor:

from sklearn.ensemble import RandomForestRegressor

rr = RandomForestRegressor(n_estimators=200, n_jobs = 6, verbose = True)
rr.fit(X_train, y_train)
rr.score(X_test, y_test)

Using XGBRegressor:

xgb = XGBRegressor()
xgb.fit(X_train, y_train)
xgb.score(X_test, y_test)
Score

XBGRFRegressor:


from xgboost import XGBRFRegressor

xgbrf = XGBRFRegressor()
xgbrf.fit(X_train, y_train)
xgbrf.score(X_test, y_test)
Score

Checking r2_score:

R² (coefficient of determination) regression score function.

Best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always predicts the expected value of y, disregarding the input features, would get a R² score of 0.0.

from sklearn.metrics import r2_scorefor model in models:
print("-----------------------------------")
print(r2_score(y_test, model.predict(X_test)))
Output

Conclusion:

XBGRFRegressor gave the best r2_score.

This was all we need to do😃.

Congratulations! We have successfully built a model to predict the medical charges. 🎉🎊

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
TANISH SAWANT

TANISH SAWANT

An avid learner who loves exploring the endless world of data science and artificial intelligence. Fascinated by the limitless applications of ML and AI.