Exploratory Data Analytics
Analysis of diamond dataset to discover patterns & behaviors based on categorical and continuous features
Diamond pricing involves a complex mechanism influenced by multiple factors such as carat, cut, color, and price. This article analyzes the correlation between these factors and depicts with visualizations.
Exploratory data analysis
R diamond.csv dataset includes approximately 54K observations with 10 variables including carat, cut, color, clarity, depth, table, price, x (length in mm), y (width in mm), and z (depth in mm). Overall a clean dataset with no missing values or messy data.
Structure of the dataset (R lang)
- Ordinal variables/categorical features include cut, color and clarity.
- Continuous features/double-precision floating point variables include carat, depth, table, price, x, y and z.
- Carat ranges from 0.2 to 5.01 with a median of 0.7
- Diamond cut includes 5 types of ordinal values include Fair, Good, Very Good, Premium, Ideal. The majority of the diamonds in the dataset have an Ideal cut (21551 observations) followed by Premium (13791 observations), Very good (12082 observations), Good (4906 observations) and Fair (1610 observations).
- Diamond color includes 7 types of ordinal values ranging from D (best) to J (worst) with G having the highest observation of 11292 followed by E at 9797.
- Diamond clarity measures how clear the diamond is. This variable includes 8 types of ordinal values as I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, and IF (best). The top 2 most observations are for SI1 and VS2 with 13065 and 12258 respectively. I1 and IF are categorized as other in the summary and have a total of 2.5K observations approximately.
- Diamond depth ranges from 43 to 79 with the median at 61.80
- Diamond table ranges from 43 to 95 with the median at 57
- Diamond price ranges from $326 to $18.8K USD with the median at $2.4K
- Length in mm denoted as variable x ranges from 0–10.74 with the median at 5.7mm
- Width in mm denoted as variable x ranges from 0–58.9 with the median at 5.71mm
- Depth in mm denoted as variable x ranges from 0–31.8 with the median at 3.53mm
The scatter plot shows a strong positive correlation between carat and price. Low carat diamond denoted on x axis is mostly observed. It is clear that lower carat diamonds have lower price. As the carat size increases, we see diamond price increasing. Z dimension shows diamond color. G is the most popular color, followed by E, F, H, D, I and J. There are some outliers for high carat and high price explained by the fact that some diamonds may be cut for weight retention rather than beauty. When more of rough diamonds is cut we can assume that the carat loss is compensated by higher price.
This plot shows the carats on x axis and its frequency on y axis. Majority of the diamonds have a clarity in the bin of 0.2 to 0.5 and the most popular observation (~30K) in the dataset. 0.4 is the most recorded carat. There aren’t any significant observations for diamonds in the carat range of 3.2 to 5.01
This plot shows different cuts such as Fair, Good, Very Good, Premium, Ideal vs carat along with outliers. All good, very good and ideal diamonds weigh less than 1 carat with the median <1. There are few fair and premium cut diamonds where the weight is slightly higher than 1 carat but the median is still <=1 carat. The premium cut has the highest range of between the 1st and 3rd quartile. Bigger diamonds mostly have fair cut.
Findings 1.4: This distribution plot compares cut and carat. The violin plot shows the density of observations for the cut by carat. The dot represents the mean for each cut. All good, very good, and ideal diamonds weigh less than 1 carat with a median <1. There are few fair and premium cut diamonds where the weight is slightly higher than 1 carat but the median is still <=1 carat. The premium cut has the highest range of between the 1st and 3rd quartile. Ideal cut diamonds have lower carat density whereas Fair cut diamonds have higher carat density.