The Startup
Published in

The Startup

Exploratory Data Analytics

Analysis of diamond dataset to discover patterns & behaviors based on categorical and continuous features

Photo Credits: Edgar Soto. https://unsplash.com/photos/gb0BZGae1Nk

Diamond pricing involves a complex mechanism influenced by multiple factors such as carat, cut, color, and price. This article analyzes the correlation between these factors and depicts with visualizations.

Exploratory data analysis
R diamond.csv dataset includes approximately 54K observations with 10 variables including carat, cut, color, clarity, depth, table, price, x (length in mm), y (width in mm), and z (depth in mm). Overall a clean dataset with no missing values or messy data.

Structure of the dataset (R lang)

Data Description

  • Ordinal variables/categorical features include cut, color and clarity.
  • Continuous features/double-precision floating point variables include carat, depth, table, price, x, y and z.
  • Carat ranges from 0.2 to 5.01 with a median of 0.7
  • Diamond cut includes 5 types of ordinal values include Fair, Good, Very Good, Premium, Ideal. The majority of the diamonds in the dataset have an Ideal cut (21551 observations) followed by Premium (13791 observations), Very good (12082 observations), Good (4906 observations) and Fair (1610 observations).
  • Diamond color includes 7 types of ordinal values ranging from D (best) to J (worst) with G having the highest observation of 11292 followed by E at 9797.
  • Diamond clarity measures how clear the diamond is. This variable includes 8 types of ordinal values as I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, and IF (best). The top 2 most observations are for SI1 and VS2 with 13065 and 12258 respectively. I1 and IF are categorized as other in the summary and have a total of 2.5K observations approximately.
  • Diamond depth ranges from 43 to 79 with the median at 61.80
  • Diamond table ranges from 43 to 95 with the median at 57
  • Diamond price ranges from $326 to $18.8K USD with the median at $2.4K
  • Length in mm denoted as variable x ranges from 0–10.74 with the median at 5.7mm
  • Width in mm denoted as variable x ranges from 0–58.9 with the median at 5.71mm
  • Depth in mm denoted as variable x ranges from 0–31.8 with the median at 3.53mm
1.1 Scatterplot of diamond carat vs price

Findings 1.1:
The scatter plot shows a strong positive correlation between carat and price. Low carat diamond denoted on x axis is mostly observed. It is clear that lower carat diamonds have lower price. As the carat size increases, we see diamond price increasing. Z dimension shows diamond color. G is the most popular color, followed by E, F, H, D, I and J. There are some outliers for high carat and high price explained by the fact that some diamonds may be cut for weight retention rather than beauty. When more of rough diamonds is cut we can assume that the carat loss is compensated by higher price.

1.2 Histogram of diamond carat vs frequency in dataset

Findings 1.2:
This plot shows the carats on x axis and its frequency on y axis. Majority of the diamonds have a clarity in the bin of 0.2 to 0.5 and the most popular observation (~30K) in the dataset. 0.4 is the most recorded carat. There aren’t any significant observations for diamonds in the carat range of 3.2 to 5.01

1.3 Box plot diamond cut vs carat

Findings 1.3:
This plot shows different cuts such as Fair, Good, Very Good, Premium, Ideal vs carat along with outliers. All good, very good and ideal diamonds weigh less than 1 carat with the median <1. There are few fair and premium cut diamonds where the weight is slightly higher than 1 carat but the median is still <=1 carat. The premium cut has the highest range of between the 1st and 3rd quartile. Bigger diamonds mostly have fair cut.

1.4 Violin plot diamond cut vs carat

Findings 1.4: This distribution plot compares cut and carat. The violin plot shows the density of observations for the cut by carat. The dot represents the mean for each cut. All good, very good, and ideal diamonds weigh less than 1 carat with a median <1. There are few fair and premium cut diamonds where the weight is slightly higher than 1 carat but the median is still <=1 carat. The premium cut has the highest range of between the 1st and 3rd quartile. Ideal cut diamonds have lower carat density whereas Fair cut diamonds have higher carat density.

--

--

--

Get smarter at building your thing. Follow to join The Startup’s +8 million monthly readers & +756K followers.

Recommended from Medium

What Distribution Do Prime Factors Follow?

HOW TO HIDE A CRIMINAL RECORD (legally)

Basic Algorithms — Counting Inversions

Statistics with Python II

Loan Data Prediction

Local Plan receives mapping tool boost thanks to Government grant

Local Plan receives mapping tool boost thanks to Government grant

READ/DOWNLOAD%! Understanding Statistical Process Control FULL BOOK PDF & FULL AUDIOBOOK

AlgoRithm: Naive Bayes Classifier

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Poonam Rao

Poonam Rao

Exec Director StratEx - I bring to the table blend of data science, finance and strategy management skills with 20+ years of experience in insurance & fintech.

More from Medium

How to Open a Shapefile in Tableau

How to Open a Shapefile in Tableau

2022 Evictions Revisited — Tableau Edition!

“Always” start your bar charts at zero

The Analytics Plane