Cluster Analysis: Exploring Salary Patterns in Data Science Careers

Chukwunedu Onwuka
INST414: Data Science Techniques
6 min readApr 1, 2024

Introduction:

In the rapidly evolving field of data science careers, understanding the underlying patterns and trends in salary distributions can provide valuable insights for professionals and employers alike. This analysis aims to utilize cluster analysis techniques to uncover distinct salary patterns within the data science job market, shedding light on potential salary brackets and associated factors.

Data Description:

The dataset used for this analysis is from Kaggle.com, and comprises information on data science job roles, including work year, experience level, employment type, job title, salary, remote work ratio, company location, and company size. This dataset is relevant for understanding salary distributions and exploring potential salary clusters within the data science field.

The fields included are:

  • Work Year: The year in which the salary was paid.
  • Experience Level: The experience level in the job during the year with the following possible values: EN: Entry-level / Junior MI: Mid-level / Intermediate SE: Senior-level / Expert EX: Executive-level / Director
  • Employment Type: The type of employment for the role with the following possible values: PT: Part-time FT: Full-time CT: Contract FL: Freelance
  • Job Title: The role worked in during the year.
  • Salary: The total gross salary amount paid.
  • Salary Currency: The currency of the salary paid as an ISO 4217 currency code.
  • Salary in USD: The salary amount converted to USD using the FX rate divided by the average USD rate for the respective year via fxdata.foorilla.com.
  • Employee Residence: Employee’s primary country of residence during the work year as an ISO 3166 country code.
  • Remote Ratio: The overall amount of work done remotely, with possible values as follows: 0: No remote work (less than 20%) 50: Partially remote 100: Fully remote (more than 80%)
  • Company Location: The country of the employer’s main office or contracting branch as an ISO 3166 country code.
  • Company Size: The average number of people that worked for the company during the year, with the following classifications: S: Less than 50 employees (small) M: 50 to 250 employees (medium) L: More than 250 employees (large)

These attributes provide comprehensive insights into the factors influencing salary distributions, making the dataset highly relevant for analyzing distinct salary patterns in the data science field.

Measuring Similarity:

We are using the Euclidean distance metric to measure similarity between data points. The Euclidean distance calculates the straight-line distance between two points in a multi-dimensional space. For our analysis, we consider each data point as a vector in a five-dimensional space, with each dimension representing one of the selected features: salary, experience level, employment type, remote ratio, and company size.

Features Used for Measuring Similarity:

  1. Salary: The total gross salary amount paid.
  2. Experience Level: The experience level in the job during the year with possible values: Entry-level, Mid-level, Senior-level, Executive-level.
  3. Employment Type: The type of employment for the role with possible values: Part-time, Full-time, Contract, Freelance.
  4. Remote Ratio: The overall amount of work done remotely, with values indicating the percentage of remote work (0%, 50%, 100%).
  5. Company Size: The average number of people that worked for the company during the year, categorized as small, medium, or large.

By computing the Euclidean distance between pairs of data points based on these five features, we can quantify the similarity between them. Data points with smaller Euclidean distances are considered more similar, while those with larger distances are considered less similar. This approach allows us to identify clusters of data points that exhibit similar characteristics within the dataset, providing insights into salary patterns and factors influencing compensation in the data science job market.

Selecting K:

In choosing the number of clusters (k) for our analysis, we used a method called the Elbow Method. This method involves plotting the within-cluster sum of squares (WCSS) against different k values and looking for a point where the rate of decrease in WCSS slows down, resembling an “elbow” in the plot.

When we plotted the WCSS values for k=1 to k=10, we found a clear “elbow” point at k=4. This point indicates a significant reduction in WCSS, suggesting that adding more clusters beyond k=4 doesn’t provide much improvement in clustering quality. Therefore, we chose k=4 as the optimal number of clusters for our analysis.

Cluster Interpretation:

The clustering analysis reveals distinct groups within the dataset based on various features such as experience level, salary, remote work ratio, and company size. These clusters likely represent different segments of the job market within the data science field. For example, one cluster may represent entry-level positions with lower salaries and minimal experience requirements, while another cluster may represent senior-level roles with higher salaries and extensive experience. Each cluster provides insight into the diversity of roles and characteristics present in the dataset, helping to understand the job landscape in the data science industry.

How It Answers the Question:

The analysis utilized the Elbow Method to determine the optimal number of clusters for uncovering distinct salary patterns in the data science job market. By plotting the within-cluster sum of squares (WCSS) against different numbers of clusters (k), the point where the rate of decrease in WCSS slowed down significantly was identified as k=4. This suggests that four clusters best represent the salary patterns in the dataset. This approach provides valuable insights into salary distributions and associated factors, aiding professionals and employers in decision-making related to career choices, job offerings, and salary negotiations in the data science field.

Data Cleanup Process:

To clean up the data, I performed the following steps as part of the code:

  1. Handling Missing Values: If there were any missing values in the dataset, I handled them before performing clustering. This might involve techniques such as imputation or dropping rows/columns with missing data.
  2. Data Type Conversion: If necessary, I converted the data types of features to ensure they were appropriate for analysis. For example, converting strings to categorical data type if needed.
  3. Encoding Categorical Variables: If the clustering algorithm required numerical input, I encoded categorical variables using techniques such as one-hot encoding or label encoding.

Limitations and Biases:

  1. Limited Features: The analysis only considered a few features such as experience level, employment type, salary, remote work ratio, and company size. Other relevant factors such as educational background, specific skills, industry, and geographical location could significantly impact salary but were not included in the analysis.
  2. Homogeneity Assumption: The analysis assumes that data points within each cluster are homogeneous, which may not always be the case. There could be significant variations within clusters that are not captured by the chosen features.
  3. Data Quality: The accuracy and reliability of the salary data could vary, leading to potential biases in the analysis. For example, self-reported salaries might not reflect actual earnings accurately.
  4. Clustering Algorithm Sensitivity: The choice of clustering algorithm and its parameters (e.g., number of clusters) can significantly affect the results. Different algorithms or parameter settings might produce different clusterings.
  5. Sample Bias: The dataset used for analysis might not be representative of the entire data science job market. It could be biased towards certain types of roles, industries, or geographic locations.
  6. Temporal Bias: The analysis is based on data collected at a specific point in time. Salary trends and job market dynamics may change over time, affecting the relevance of the analysis.

Github Link — https://github.com/ChukwuneduOnwuka/Clustering/blob/main/Clustering.ipynb

--

--