Clustering Data Science Jobs

Published in

INST414: Data Science Techniques

5 min readMay 15, 2023

Introduction

This project aims to cluster data science jobs based on their similarity based on demographic data. Insights from this clustering can be used to determine what jobs in data science are fairly similar and can guide an individual’s efforts to develop the skills they need to transition into new roles. On that note, the results of this project can be used by individuals that are trying to change their roles in the Data Science field. This project uses the Principal Component Analysis dimensionality reduction method to reduce the number of variables under study (effectively compressing the dataset), and the KMeans Machine Learning algorithm to cluster data science jobs. The results of the analysis are interpreted and discussed in later sections of this study.

Tools and Sources

The primary tools used for this project were the kaggle API, the pandas data analytics library, and the scikit-learn machine learning library:

The kaggle API was used to retrieve the ds_salaries.csv dataset.
The pandas library is a popular data analysis library and was used to store data in tabular format.
The scikit-learn library is an industry-level machine learning library, and it was used to perform dimensionality reduction and clustering operations.

The dataset used for this project was acquired from Kaggle using their Python library kaggle , and it is the ‘Data Science Salaries 2023 💸’ dataset. This dataset was chosen because it contains sufficient demographic to complete the analysis portion of this study, including the following: title, salary_in_usd, company_size, and company_location.

import kaggle 
import pandas pd 

# Download the dataset
dataset = kaggle.api.dataset_download_file(
  'randomarnab/ds_salaries.csv', 
  'ds_salaries.csv'
)

# Load the dataset into a pandas dataframe
salaries_df = pd.read_csv("data/ds_salaries.csv")
print(f"Rows/Columns: {salaries_df.shape}", '\n')
print(f"Variables: {salaries_df.columns}")

‘**Data Science Salaries 2023** 💸’ Dataset from Kaggle

Data Cleaning

Before starting the analysis portion of the study, the data had to be cleaned first. The following steps were taken to ensure the data was properly formatted:

Remove categorical variables: The sklearn libraries used for this project require that all the columns be a numeric datatype.

# Converting categorical variables into dummy variables

salaries_df_with_dummies = pd.get_dummies(salaries_df) # select the features
print(salaries_df_with_dummies.shape, '\n')
print(salaries_df_with_dummies.columns)

Encode columns: For ease of access, and programmability, the columns had to be encoded to remove whitespace and special characters that conflict with python syntax.

# Cleaning up the column names of the dataframe

def encode(s): 
    """s: A string to be encoded. 
    
    Returns a formatted string, with all whitespace removed."""
    
    s = s.strip()           # remove trailing whitespace
    s = s.replace(' ', '_') # replace spaces with underscores 
    s = s.replace('-', '_') # replace hyphen with underscores

    # replace braces
    s = s.replace('(', '_')
    s = s.replace(')', '_')

    # replace colons
    s = s.replace(':', '_')
    s = s.replace(';', '_')

    # make the text lowercase
    s = s.lower()           
    
    try: 
        # throw exception if the first character is non-numeric, otherwise, 
        # set the first character as an underscore
        c = int(s[0])
        s = "_" + s
    except ValueError:
        pass

    return s
    
salaries_df_with_dummies.columns = salaries_df_with_dummies.columns.map(encode)
salaries_df_with_dummies.columns

ds_salaries DataFrame with Encoded Column Names — ‘ds_salaries’ DataFrame with Encoded Columns

Analysis and Results

After removing categorical variables and encoding the columns, the number of columns in the dataset exploded to 278, from 11 prior to cleaning. This number of features would yield inadequate results, thus, Principal Component Analysis (with Singular Value Decomposition) was used to reduce the number of features.

from sklearn.decomposition import PCA

# Convert the dataframe to a numpy array
X = salaries_df_with_dummies.to_numpy()

# Use Principal Component Analysis to 
# reduce the dimensionality of the dataset 
pca = PCA(n_components=5)
X_pca = pca.fit_transform(X)
X_pca.shape

Rows x Columns after Principal Component Analysis (PCA)

According to documentation from scikit-learn.com, “The KMeans algorithm clusters data by trying to separate samples in n groups of equal variance, minimizing a criterion known as the inertia or within-cluster sum-of-squares.” The limitations of the inertia metric are: (1) This metric is not normalized, so it can take on enormous, and ambiguous values, and (2) and it assumes that data points in a dataset are isotropic.

The elbow method was used to determine the number of clusters, k, to separate the dataset into. The elbow method works by plotting the number of clusters k against the within-cluster sum-of-squares and choosing the k at which the within-cluster sum-of-squares begins to decline. This method yielded k=5 as the optimal number of clusters.

# Use elbow method to determine the optimal number of clusters
def calculate_WSS(points, kmax):
  sse = []
  for k in range(1, kmax+1):
    kmeans = KMeans(n_clusters = k).fit(points)
    centroids = kmeans.cluster_centers_
    pred_clusters = kmeans.predict(points)
    curr_sse = 0
    
    # calculate square of Euclidean distance of each point from its cluster center and add to current WSS
    for i in range(len(points)):
      curr_center = centroids[pred_clusters[i]]
      curr_sse += (points[i, 0] - curr_center[0]) ** 2 + (points[i, 1] - curr_center[1]) ** 2
      
    sse.append(curr_sse)
  return sse

# Calculate the within-cluster sum of squares for different values of k
kmax = 10
sse = calculate_WSS(X_pca, kmax)

# Plot the curve
fig, ax = plt.subplots(figsize=(12, 8))
ax.plot(range(1, 11), sse)

Optimizing the Number of Clusters for the k-Means algorithm

To determine what features were reduced by the PCA operations, the analysis must be done on the eigenvectors of the covariance matrix used to produce the reduced matrix. Circumventing this process, it can be inferred that each of the clusters (shown below) represents the groups with the most similar salary, and company size, as these were the variables with the highest variance, according to preliminary analysis.

Limitations

One potential limitation of this project is that it uses Principal Component Analysis (PCA), a linear method for dimensionality reduction, to reduce the features of the datasets. This means that PCA may not be effective at capturing non-linear relationships between variables in the dataset.

Converting categorical variables into binary variables can be helpful, but PCA is designed to find principal components that capture the most variance in the data, and categorical variables do not have a variance structure. Therefore, PCA may not be able to find principal components that capture the most information about the categorical variables. A future iteration of this project would consider using a different method, such as Multiple Correspondence Analysis (MCA), which is designed to identify relationships between categorical variables.

Conclusion

This study aims to cluster data science jobs based on their similarity using Principal Component Analysis (PCA) dimensionality reduction and the KMeans machine learning algorithm. The cleaned dataset was reduced from 278 columns to 5 using PCA, and the optimal number of clusters was determined to be 5 using the elbow method. One potential limitation of this project is that it uses PCA, which is a linear method for dimensionality reduction and may not capture non-linear relationships between variables in the dataset.

Link to the study repository: clustering-data-science-jobs