A Regression Project in Python; Predict Diamonds Prices Based on Cut, Color, Clarity and Other Attributes

Roi Polanitzer
16 min readDec 29, 2021

--

Motivation

In order to build an end-to-end regression (i.e., evaluation) project in Python, based on the CRISP-DM Methodology, prices, I chose the diamonds price dataset that was sourced from seaborn (a Python data visualization library). This dataset contains the prices and other attributes of almost 54,000 diamonds. It is a great dataset to work with Python programming, data analysis and visualizations, data science and machine learning.

CRISP-DM stands for CRoss-Industry Standard Process for Data Mining and was developed in 1996 under the ESPRIT initiative. It has been a favourite for business analysts and data scientists alike owing to its easily adaptable model.

What Is The CRISP-DM Process?

CRISP-DM is one of the more structured approaches to solving a problem that requires data science. More precisely, CRISP-DM focuses on the data science part of the operation and features a 6-step process.

Step 1 — Business Understanding

The first step of the CRISP-DM process is business understanding. This is one of the big reasons it is popular among business intelligence practitioners; a BI-first approach. This step includes the basic groundwork for the rest of the project, such as determining goals and objectives, producing a plan and planning out business success criteria.

It is also important to gain an understanding of the workings of the situation, requiring a deep assessment of the situation. As the process requires data mining, it is also important to determine which features to explore and which to eliminate. The goals of the data mining procedure must also be established.

This will enable the project to have a much more focused view of things, leading to less time mining data which will not be used. Along with determining where the business needs improvement, this step also shows the pain points of the organization. Knowing the company inside out is important for deriving actionable insights.

  1. Business Needs- De Beers is a is the world’s largest diamond company. De Beers needs to know the updated market price (in US dollars) of any diamond it sells. This is a classic regression (i.e., evaluation) problem, in which I need to collect the relevant data, build a useful model and estimate the expected error.
  2. Data Science Objective- I need to build a model which predicts, with a high-level accuracy, the market price in US dollars of a diamond by relating the prices of De Beers diamonds which were sold to their features. Since, I want my model to be as accurate as possible, I will optimize the mean absolute error on the testing set (which is a metric for accuracy) instead of the R² (i.e., the coefficient of determination) regression score on the testing set (which is a metric of precision).

Step 2 — Data Understanding

One of the biggest parts of data science is, of course, handling data. A well-managed set of data sources and collection of data marks the difference between a successful project and a confusing mess.

The second step of CRISP-DM involves acquiring the data listed in the project. All data relevant to the project goals must be collected, with reports being made at every stage. After collection, efforts must be made to explore the data using methods such as querying, data visualization and more.

It is also important to keep track of the quality of the data in order to ensure that unclean data doesn’t hamper the results. Moreover, there should be a back-and-forth with the business understanding step for a truly flexible approach.

import pandas as pd
import numpy as np
import seaborn as sb

Let’s collect the relevant data from the internet

df = sb.load_dataset("Diamonds")
  1. Dataset kind- The dataset has 53,940 records of diamonds and contains 10 fields (9 of them are features and 1 is the target variable).
  2. Dataset size- Since the dataset contains of dozens of thousands of observations, I can classify it as a large dataset.
  3. Main features:
  • carat (carat weight of the diamond)
  • cut (cut quality of the cut)
  • color (color diamond color)
  • clarity (clarity a measurement of how clear the diamond is)
  • x (length in mm)
  • y (width in mm)
  • z (depth in mm)
  • depth (total depth percentage = z / mean(x, y))
  • table (width of top of diamond relative to widest point)

Exploratory Analysis

df.shape
df.columns
df.head()
df.info()
df.isna().sum()

I now use the describe() method to show the summary statistics of the numeric variables.

df.describe()

The count, mean, min and max rows are self-explanatory. The std shows the standard deviation, and the 25%, 50% and 75% rows show the corresponding percentiles.

Data Analysis & Visualization

corr_matrix = df.corr()
corr_matrix["price"].sort_values(ascending=False)
df.query(“x==0 or y==0 or z==0”)
sb.scatterplot(x=df.carat , y=df.price)
len(df.query("carat>3"))

32

len(df.query(“carat>2”))

1889

sb.distplot(df.price)
sb.pairplot(df)

Step 3 — Data Preparation

Data preparation is the step where data to be used is determined. This makes the difference between looking in the wrong place and finding a solution that works. Data mining goals must be solidified, along with data cleaning and integration processes.

Records must be kept at every step in order to operate within the constraints of the project. The technical constraints and other factors determining the data must also be pinned down to eliminate bias and derive insights more easily.

  1. Missing values or outliers- The dataset doesn’t include any missing values but diamonds with z equal to 0 or greater than 10, y equal to 0 or greater than 10 and x equal to 0 are outliers.
  2. Dummy variables for categorical variables- The dataset includes 3 categorical variables (cut, color, and clarity). I chose to create dummy variables for those categorical variables using a “replace” function.
  3. Other techniques- First, I created a new feature called “vol” (for volume) which is a multiplication of x, y, and z and then I replaced x, y, and z with this new feature. Second, I split my data into train and test sub-sets with a ratio of train to test sets of 67:33 and a random state of 42.

Removing Outliers

df.query(“z>10 or y>10”)
df.query(“z>10 or y>10”).index
df.drop(df.query(“z>10 or y>10”).index, inplace=True)df.query(“x==0 or y==0 or z==0”)
df.query("x==0 or y==0 or z==0").index
df.drop(df.query("x==0 or y==0 or z==0").index, inplace=True)sb.pairplot(df)
df.head()

Feature Engineering

df[‘vol’] = df.x * df.y * df.z
df.head()
df.drop([‘x’,’y’,’z’], axis=1, inplace=True)
df.head()

Create dummy variables

df.cut.unique()
df.cut.replace({‘Ideal’:5, ‘Premium’:4, ‘Good’:2, ‘Very Good’:3, ‘Fair’:1}, inplace=True)df.head()
df.color.unique()
df.color.replace({‘E’:2, ‘I’:6, ‘J’:7, ‘H’:5, ‘F’:3, ‘G’:4, ‘D’:1}, inplace=True)df.head()
df.clarity.unique()
df.clarity.replace({‘SI2’:1, ‘SI1’:2, ‘VS1’:3, ‘VS2’:4, ‘VVS2’:5, ‘VVS1’:6, ‘I1’:7, ‘IF’:8}, inplace=True)df.head()

Splitting to Train and Test

X = df.drop(['price'], axis=1)X.head()
y = df[‘price’]y.head()
from sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

Step 4 — Modeling

This is where most of the work is done, with the modeling method being integral to the kind of problem to be solved. If the wrong method is used, the results obtained will not be comparable to results gained when the method is right.

Narrow down the technique and set the stage for it to be used effectively. This includes taking care of the assumptions and preparing the data for use with the model.

A test model must also be designed for proof-of-concept and suitability tests. The model should also be fitted for the problem, with testing and backpropagation being important parts if the model used is a neural network.

The approach must also be tailored with respect to the goals and the business and data understanding in order to create a good fit for the problem. In this manner, the model should be assessed.

  1. I have trained 3 Machine Learning models (Linear Regression, Decision Tree Regressor and Random Forest Regressor) “out of the box”, meaning without changing the hyperparameters of each model.
  2. For each model, I checked for Overfitting by comparing the R-squared of each model on the test set to the R-squared of that model on the train test.
  3. For each model, I created a scatter plot of the true prices from the market versus the predicted price from the model.

Train and Build a Linear Regression Model

Liner Regression is one of the most common regression algorithms.

import sklearn.linear_model as sllinreg = sl.LinearRegression()linreg.fit(X_train, y_train)

LinearRegression()

print(‘R squared of the Linear Regression on training set: {:.2%}’.format(linreg.score(X_train, y_train)))print(‘R squared of the Linear Regression on test set: {:.2%}’.format(linreg.score(X_test, y_test)))

R squared of the Linear Regression on training set: 88.40%
R squared of the Linear Regression on test set: 88.54%

The R squared on the training set is almost equal to the R squared on the test set. This is an indicative that our linear regression model is not overfitting and therefore generalizing well to new data.

In addition, in our linear regression model, 88.54% of the variability in the diamond prices can be explained using the 7 feature we chose (i.e., carat, cut, color, clarity, table, depth, and vol). This is very good.

y_pred = linreg.predict(X_test)
sb.scatterplot(x=y_test , y=y_pred, color="blue")

Train and Build a Decision Tree Regressor Model

import sklearn.tree as sttree = st.DecisionTreeRegressor(random_state=42)tree.fit(X_train, y_train)

DecisionTreeRegressor(random_state=42)

print(‘R squared of the Decision Tree Regressor on training set: {:.2%}’.format(tree.score(X_train, y_train)))print(‘R squared of the Decision Tree Regressor on test set: {:.2%}’.format(tree.score(X_test, y_test)))

R squared of the Decision Tree Regressor on training set: 99.99%
R squared of the Decision Tree Regressor on test set: 96.73%

The R squared on the training set is a bit higher than the R squared on the test set, but that doesn’t mean that our decision tree regressor model is overfitting. On the contrary, our decision tree regressor model is generalizing well to new data.

In addition, in our decision tree regressor model, 96.73% of the variability in the diamond prices can be explained using the 7 feature we chose (i.e., carat, cut, color, clarity, table, depth, and vol). This is excellent.

y_pred1 = tree.predict(X_test)
sb.scatterplot(x=y_test , y=y_pred1, color="red")

Train and Build a Random Forest Regressor Model

Let’s apply a random forest consisting of 100 trees on the diamonds data set:

import sklearn.ensemble as serf = se.RandomForestRegressor(n_estimators=100, random_state=42)rf.fit(X_train, y_train)

RandomForestRegressor(random_state=42)

print('R squared of the Random Forest Regressor on training set: {:.2%}'.format(rf.score(X_train, y_train)))print('R squared of the Random Forest Regressor on test set: {:.2%}'.format(rf.score(X_test, y_test)))

R squared of the Random Forest Regressor on training set: 99.72%
R squared of the Random Forest Regressor on test set: 98.14%

The R squared on the training set is a bit higher than the R squared on the test set, but that doesn’t mean that our random forest regressor model is overfitting. On the contrary, our random forest regressor model is generalizing well to new data.

In addition, in our random forest regressor model, 98.15% of the variability in the diamond prices can be explained using the 7 feature we chose (i.e., carat, cut, color, clarity, table, depth, and vol). This is excellent.

y_pred2 = rf.predict(X_test)
sb.scatterplot(x=y_test , y=y_pred2, color="green")

Step 5 — Evaluation

This step will be for evaluating factors such as the accuracy and generality of the model. In addition to this, the process must also be put through a fine-combed inspection to ensure that there are no errors.

A revision sub-step is also present in this, as a way to fine-tune the solution offered by this process. This includes going back to the business understanding roots and seeing if the process makes sense in a sustainable and scalable fashion.

A report must also be compiled for documentation. In addition to this, any possible issues must be ironed out before the next step.

I checked the models MAE and MSLE scores on the test set.

Evaluating the Linear Regression Model

d = {‘true’: y_test, ‘predicted’: y_pred}
df_lr = pd.DataFrame(data=d)
df_lr[‘diff’] = df_lr[‘predicted’]-df_lr[‘true’]
df_lr
sb.displot(y_pred — y_test, color=”blue”)

Calculate the model’s expected error in dollars using the MAE (Mean Absolute Error) metric:

print(‘Mean Absolute Error of the Linear Regression on test set is {:.2f}’.format(mt.mean_absolute_error(y_test,y_pred1)))

Mean Absolute Error of the Linear Regression on test set is 869.38

Our linear regression model was able to predict the price of every diamond in the test set with an error of ± $869.38 of the real price.

Calculate the model’s expected error in percentage using the MSLE (Mean Squared Log Error) metric:

print(‘Mean Squared Log Error of the Linear Regression on test set is {:.2%}’.format(mt.mean_squared_log_error(y_test,y_pred)))

ValueError: Mean Squared Logarithmic Error cannot be used when targets contain negative values.

It turns out that our liner regression model gives negative prices.

import matplotlib.pyplot as pltplt.hist(y_pred[y_pred < 0 ])

This model in terms of product is a bad product because there is no meaning to a negative price.

Evaluating the Decision Tree Regressor Model

d = {‘true’: y_test, ‘predicted’: y_pred1}
df_dt = pd.DataFrame(data=d)
df_dt[‘diff’] = df_dt[‘predicted’]-df_dt[‘true’]
df_dt

Calculate the model’s expected error in dollars using the MAE (Mean Absolute Error) metric:

print(‘Mean Absolute Error of the Decision Tree Regressor on test set is {:.2f}’.format(mt.mean_absolute_error(y_test,y_pred1)))

Mean Absolute Error of the Decision Tree Regressor on test set is 354.01

Our decision tree regressor model was able to predict the price of every diamond in the test set with an error of ± $354.01 of the real price.

Calculate the model’s expected error in percentage using the MSLE (Mean Squared Log Error) metric:

print('Mean Squared Log Error of the Decision Tree Regressor on test set is {:.2%}'.format(mt.mean_squared_log_error(y_test,y_pred1)))

Mean Squared Log Error of the Decision Tree Regressor on test set is 2.07%

Our decision tree regressor model was able to predict the price of every diamond in the test set with an error of ± 2.07% of the real price.

Evaluating the Random Forest Regressor Model

d = {‘true’: y_test, ‘predicted’: y_pred2}
df_rf = pd.DataFrame(data=d)
df_rf[‘diff’] = df_rf[‘predicted’]-df_rf[‘true’]
df_rf

Calculate the model’s expected error in dollars using the MAE (Mean Absolute Error) metric:

print(‘Mean Absolute Error of the Random Forest Regressor on test set is {:.2f}’.format(mt.mean_absolute_error(y_test,y_pred2)))

Mean Absolute Error of the Random Forest Regressor on test set is 277.00

Our random forest regressor model was able to predict the price of every diamond in the test set with an error of ± $277 of the real price.

Calculate the model’s expected error in percentage using the MSLE (Mean Squared Log Error) metric:

print(‘Mean Squared Log Error of the Random Forest Regressor on test set is {:.2%}’.format(mt.mean_squared_log_error(y_test,y_pred2)))

Mean Squared Log Error of the Random Forest Regressor on test set is 1.25%

Our random forest regressor model was able to predict the price of every diamond in the test set with an error of ± 1.25% of the real price.

Selected Model

I chose the Random Forest Regressor model as the best model among the three, based on its MAE and MSLE scores on the test set.

model = rf
model

RandomForestRegressor(random_state=42)

Step 6 — Deployment

This step will differ depending on the kind of problem that the organization is facing. However, the basics remain mostly the same. The first thing to do is to summarize how the solution will be deployed in an organized manner.

The solution also needs to future-proofed to ensure that it can be used easily for an extended period of time. Factors such as monitoring and maintenance should also be taken care of, along with a final report and review of the solution.

So, our Random Forest model is a pretty good model for predicting the market price of a diamond. Now how do we predict the market price of a new diamond new diamond?

Suppose there is a new diamond which has: carat=0.23, cut=5 (Ideal), color=2 (E), clarity=1 (SI2), depth=61.5, table=55, vol=38.20 (x=3.95, y=3.98 and z=2.43).

new_diamond = [0.23, 5, 2, 1, 61.5, 55, 38.20]
new_diamond

[0.23, 5, 2, 1, 61.5, 55, 38.2]

We can take these new data and use it to predict the market price of the new diamond.

prediction = model.predict([new_diamond])[0]print("\033[1m The market price of this new diamond is ${:.2f}".format(prediction))

The market price of this new diamond is $382.34

Saving the finalized model to pickle saves us a lot of time as we don’t have to train our model every time, we run the application. Once we save our model as pickle, you can load it later while making the prediction.

import pickle

First, let’s open a new file for our finalized model and call it “fw_model1”

f1 = open(“fw_model1”, “wb”)

Then, let’s save into this file our Random Forest model

pickle.dump(model , f1)

And let’s close this file.

f1.close()

Now, let’s open a new Python notebook and write

import pickle
f2 = open(“fw_model1”, “rb”)
model = pickle.load(f2)

Now, let’s make a new prediction for the above transaction

model.predict([[0.23, 5, 2, 1, 61.5, 55, 38.20]])[0]

382.34

In Conclusion

The CRISP-DM, even today, remains as a dependable method to develop data science solutions for enterprise problems. Its BI-first approach also enables better sourcing of insights and other such data knowledge.

The flexible and iterative approach of the CRISP-DM also makes it a future-proof alternative for anyone looking to solve data science problems. Even as it is important to develop a unique method, it should also be kept in mind that using methods such as CRISP-DM bring an element of professionalism and uniformity to operational procedures.

About the Author

Roi Polanitzer, PDS, ADL, MLS, PDA, CPD

Roi Polanitzer, PDS, ADL, MLS, PDA, CPD, F.IL.A.V.F.A., FRM, is a data scientist with an extensive experience in solving machine learning problems, such as: regression, classification, clustering, recommender systems, anomaly detection, text analytics & NLP, and image processing. Mr. Polanitzer is is the Owner and Chief Data Scientist of Prediction Consultants — Advanced Analysis and Model Development, a data science firm headquartered in Rishon LeZion, Israel. He is also the Owner and Chief Appraiser of Intrinsic Value — Independent Business Appraisers, a business valuation firm that specializes in corporates, intangible assets and complex financial instruments valuation.

Over more than 16 years, he has performed data science projects such as: regression (e.g., house prices, CLV- customer lifetime value, and time-to-failure), classification (e.g., market targeting, customer churn), probability (e.g., spam filters, employee churn, fraud detection, loan default, and disease diagnostics), clustering (e.g., customer segmentation, and topic modeling), dimensionality reduction (e.g., p-values, itertools Combinations, principal components analysis, and autoencoders), recommender systems (e.g., products for a customer, and advertisements for a surfer), anomaly detection (e.g., supermarkets’ revenue and profits), text analytics (e.g., identifying market trends, web searches), NLP (e.g., sentiment analysis, cosine similarity, and text classification), image processing (e.g., image binary classification of dogs vs. cats, , and image multiclass classification of digits in sign language), and signal processing (e.g., audio binary classification of males vs. females, and audio multiclass classification of urban sounds).

Mr. Polanitzer holds various professional designations, such as a global designation called “Financial Risk Manager” (FRM, which indicates that its holder is proficient in developing, implementing and validating statistical models and mathematical algorithms such as K-Means, SVM and KNN for credit risk measurement and management) from the Global Association of Risk Professionals (GARP), a designation called “Fellow Actuary” (F.IL.A.V.F.A., which indicates that its holder is proficient in developing, implementing and validating statistical models and mathematical algorithms such as GLM, RF and NN for determining premiums in general insurance) from the Israel Association of Valuators and Financial Actuaries (IAVFA), and a designation called “Certified Risk Manager” (CRM, which indicates that its holder is proficient in developing, implementing and validating statistical models and mathematical algorithms such as DT, NB and PCA for operational risk management) from the Israeli Association of Risk Managers (IARM).

Mr. Polanitzer had studied actuarial science (i.e., implementation of statistical and data mining techniques for solving time-series analysis, dimensionality reduction, optimization and simulation problems) at the prestigious 250-hours training program of the University of Haifa, financial risk management (i.e., building statistical predictive and probabilistic models for solving regression, classification, clustering and anomaly detection) at the prestigious 250-hours training program of the program of the Ariel University, and machine learning and deep learning (i.e., building recommender systems and training neural networks for image processing and NLP) at the prestigious 500-hours training program of the John Bryce College.

He had graduated various professional trainings at the John Bryce College, such as: “Introduction to Machine Learning, AI & Data Visualization for Managers and Architects”, “Professional training in Practical Machine Learning, AI & Deep Learning with Python for Algorithm Developers & Data Scientists”, “Azure Data Fundamentals: Relational Data, Non-Relational Data and Modern Data Warehouse Analytics in Azure”, and “Azure AI Fundamentals: Azure Tools for ML, Automated ML & Visual Tools for ML and Deep Learning”.

Mr. Polanitzer had also graduated various professional trainings at the Professional Data Scientists’ Israel Association, such as: “Neural Networks and Deep Learning”, “Big Data and Cloud Services”, “Natural Language Processing and Text Mining”.

--

--

Roi Polanitzer

Chief Data Scientist at Prediction Consultants — Advanced Analysis and Model Development. https://polanitz8.wixsite.com/prediction/english