Systems in Sync: Cluster Analysis of Enterprise Laptops

Wasif Pervez
INST414: Data Science Techniques
8 min readMay 1, 2024
Photo by Yash Menghani on Unsplash

Many companies rely on information technology analysts and consultants to decide what kinds of devices and resources employees will need to perform their day-to-day tasks. They are the most equipped to do so, with extensive knowledge of how choose products that strike a balance between performance and cost. However, in office environments with (1) less funding to employ IT subject matter experts or (2) high demand to complete other tasks such as resolving service tickets, it can be difficult for IT professionals to juggle many things at once while also devoting time to staying on top of enterprise technology market trends.

This analysis is intended for an IT professional to explore a critical question: should further effort should be invested into unsupervised and supervised learning models to help catalogue different categories of devices to which the company has access and help predict to which category any newly released devices will belong, respectively?

Introducing the Dataset

The ideal data for this task would include all kinds of tech specs about various enterprise devices including both desktops, laptops, tablets, and more, since any of these might be useful for working professionals (depending on their individuals fields and roles).

The subset of this data that I have collected is a dataset containing information about 16 features including RAM, storage type/space, CPU, GPU, screen resolution, and more for over 900 laptops from Kaggle. This dataset also includes the price (in INR, or Indian rupees) and weight (in kilograms) of each device.

The following kinds of laptops are included in this dataset:

  • Notebook: laptops with just enough performance power to function for everyday personal use; usually does not have high-end components in favor of reducing cost for the average consumer.
  • Ultrabook: laptops packed with components to provide strong performance power while also remaining lightweight to maintain ease of use and transport.
  • Netbook: minimally expensive laptops with low computing power that are primarily intended for accessing the Internet.
  • Gaming: high-performing laptops that almost always include dedicated graphics cards and powerful CPUs; not extremely expensive relative to quality of components but usually bulky.
  • 2 in 1 Convertible: hybrid laptops which can be folded into tablets that usually are usually lightweight and lacking in high-end computing power to optimize for portability.
  • Workstation: extremely powerful devices which are meant for intensive applications such as animation, video editing, and data analysis; usually very heavy.

This data was published as a CSV file by user arnabchaki last year, who utilized BeautifulSoup to scrape the data from various e-commerce websites on the Internet.

Data Cleaning I: Transformation

Sanitizing this dataset began with renaming the columns to remove spaces and apply consistent lowercase formatting. Then, since most of the features represented categorical variables, I used one hot encoding to represent them as Boolean variables. For some variables (screen size, weight, clock speed, etc.), they actually represented numerical values but existed in the data as object or string values due to inclusion of units (inches, kilograms, GHz, etc.) or being lumped into a string with a bunch of other data. I used regular expressions to automate the isolation of these values.

In the case of the unit price feature, I discovered that the values were much higher than expected after converting them to US dollars. After performing more research into the dataset (i.e. consulting Kaggle forum messages written by other people who had worked with this dataset), I realized I had to multiply by the conversion rate twice to get more realistic answers (although they were still a bit off).

It also became apparent that the column containing about operating system version contained almost 100 null values. I used Pandas’ in-built function to drop all the values with NaNs.

Data Cleaning II: Removing Outliers

This part of the process was made easier by the fact that many of the features became Boolean variables and were bounded between 0 and 1. I created boxplots for the true numerical variables to examine their distributions.

Figure 1: Boxplots of numerical variables in the laptop specs dataset.

For most of these features, the outlier values were not as egregious as they seemed when viewed in the context of domain knowledge. For example, it is not impossible for laptop computers to have 32 GB of RAM or screen resolutions of 3840 x 2400 (4K UHD). However, the outliers in the weight feature plot (second from the left) were noteworthy, with the highest one clocking in at over 800 kg. I filtered out the rows where the weight value was more than the end of the plot’s whisker (~200 kg).

Data Cleaning III: Normalization

Although the other numerical features had more acceptable distributions, the magnitude of the values was much higher than the Boolean features. To help reconcile this issue, I performed L2, or min-max, normalization on these columns. One additional note: throughout the data cleaning process, I renamed columns to either a) reflect the categorical feature from which they originated in the case of Boolean features or b) elucidate the unit of measurement for numerical features.

Measures and Features of Similarity

After completing this final step of data cleaning, I had the following list of features:

Figure 2: Finalized list of dataset features after cleaning/transformation, along with their respective datatypes.

The similarity metric I used for clustering was Euclidean distance (the default for k-means).

Selecting Value of k

Figure 3: k vs. inertia graph.

I created an inertia vs. k plot and used the elbow method to select an appropriate value for k. Based on this plot, I chose 8 as the value for k.

Cluster Analysis

Cluster 0: Mid-Range Notebooks

Figure 4: Top features by average value (left) and two sample data points (right) from the current cluster.

Based on the top features in the dataset, this cluster consists entirely of Windows notebook laptops with average Intel CPUs and dedicated low-end NVIDIA graphics cards; both have the GeForce 930MX and 8GB RAM. Also, both sample data points have HD display resolution (1920x1080).

Cluster 1: Ultrabooks

Figure 5: Top features by average value (left) and two sample data points (right) from the current cluster.

Based on the top features in the dataset, this cluster consists of (mostly Windows) ultrabook laptops with low-end Intel CPUs and Intel graphics (no dedicated graphics cards). Both samples have 256GB of storage.

Cluster 2: Low-End Notebooks

Figure 6: Top features by average value (left) and two sample data points (right) from the current cluster.

Based on the top features in the dataset, this cluster mostly consists of Windows notebook laptops with low-end Intel CPUs and Intel graphics (no dedicated graphics cards); these sample points only have 4GB of RAM and 64GB of storage.

Cluster 3: High-Performance Laptops

Figure 7: Top features by average value (left) and two sample data points (right) from the current cluster.

Based on the top features in the dataset, this cluster consists mostly of Windows gaming laptops with high-end Intel CPUs (both samples have i7 chips) and dedicated NVIDIA graphics cards. Also, both sample data points have HD display resolution (1920x1080). The laptops in this cluster would be more ideal for heavy use in applications like programming and data science.

Cluster 4: Dell Notebooks

Figure 8: Top features by average value (left) and two sample data points (right) from the current cluster.

Based on the top features in the dataset, this cluster consists of Windows Dell notebook laptops with a range of Intel CPU chips. Both sample data points have HD display resolution (1920x1080), 256GB of solid state disk (SSD) storage, and dedicated AMD graphics cards.

Cluster 5: Windows 2-in-1 Laptops

Figure 9: Top features by average value (left) and two sample data points (right) from the current cluster.

Based on the top features in the dataset, this cluster consists of Windows Dell notebook laptops with a range of Intel CPU chips. Both sample data points have HD display resolution (1920x1080), 256GB of solid state disk (SSD) storage, and dedicated AMD graphics cards.

Cluster 6: Windows AMD Notebooks

Figure 10: Top features by average value (left) and two sample data points (right) from the current cluster.

Based on the top features in the dataset, this cluster consists mostly of Windows notebooks with AMD components. Both data samples have 1366x768 display resolution and low RAM (no more than 4GB).

Cluster 7: High-Performance Dell Laptops

Based on the top features in the dataset, this cluster consists of Windows laptops with Intel CPUs (both samples have i7 chips) and dedicated NVIDIA graphics cards; most of them are manufactured by Dell and a slightly smaller proportion of the data points in this represent gaming laptops. Both sample data points have HD display resolution (1920x1080) and at least 1TB of solid state drive (SSD) storage space.

Limitations and Reflections

The biggest issue I had was understanding how to represent certain tech specs as features in my data. For example, when it came to various CPUs and GPUs, I was unsure about how to classify different models (ex: NVIDIA GTX 1060 vs. AMD Radeon 7000). I knew it would be wrong to call them numerical values, but I also wanted to try to keep the number of dimensions low to avoid any performance issues. After working through the project with performance being an issue, I would like to try incorporating more one-hot encoded categorical variables into the mix and see what effect (or lack thereof) that creates in the model’s performance.

Another issue was some of the strange values for features. Although I tried to do further research and use standard methods to resolve the issues of outliers and outlandish values, some of the data still appeared off. Moving forward, some further investigation may need to be performed into what acceptable ranges of values are for each of the numerical features I used.

Reference

The code used to perform data transformation, cleaning, and manipulation of the laptop features dataset can be found at the link below.

--

--