Cancer Genomics III: Discovering genetic patterns of liver cancer using Python
By Sohil Shah [LinkedIn][GitHub]
Contributing Authors: Nicholas Giangreco [LinkedIn][GitHub], Jordi Frank [LinkedIn][GitHub], and Matthew Eng [LinkedIn] [GitHub]
This is Post 3 of a 4-part series on Cancer Genomics. You can find the code from this post on Github. If you haven’t, check out Post 1 which explains how a genomic perspective is needed to find new treatments for liver cancer and Post 2 which elaborates on the exploratory analysis illustrating publicly available liver cancer genomic data.
This post will shed light on how we utilized an unsupervised learning approach to identify underlying patterns between the RNA-Seq data representing the hallmarks of cancer and liver cancer progression (i.e. tumor stage). As our initial hypothesis, there is an association between patients’ cluster membership derived from gene expression data and patients’ liver cancer stage.
An unsupervised learning approach helps uncover structure within data to establish relationships without any previously assigned labels. We grouped patients, using their gene expression and clinical information, into clusters by utilizing dimension reduction and clustering algorithms. We used the merged dataset constructed in the second blog post and eliminated the liver cancer stage information (i.e. the labels). We restricted our methodology to only those 4220 genes representative of the hallmarks of cancer.
full_df_stage = pd.merge(rnaseq_sub.reset_index(), clinical[['submitter_id','gender','race','ethnicity','tumor_stage']], left_on='bcr_patient_barcode', right_on='submitter_id', how='inner') \
.set_index('bcr_patient_barcode') \
.drop('submitter_id', axis=1)#ensuring ID uniqueness
full_df_stage.index = [x + '-' + str(i) for i,x in enumerate(full_df_stage.index)]
print(full_df_stage.shape)
full_df_stage.head()
Machine Learning — Clustering Pipeline:
As we see in the data snippet above, the data contains gene expression data from the hallmarks of cancer and demographic information like gender, race, ethnicity, sex, etc. The figure below entails our entire end-to-end clustering pipeline.
Categorical Encoding:
As our first step after data preparation, we one-hot encoded the categorical fields like gender, race, sex, etc. for use by our machine learning models. As shown in the sample data snippet below, we now have new binary columns like gender_female, race_asian, etc. after one-hot encoding. We utilized pandas’ get_dummies function for applying the encoding technique.
# One hot encoding on full data frame to convert categorical fields into binary fields
full_df_onehot = pd.get_dummies(full_df, drop_first=False)
full_df_onehot.head()
Data Standardization & Scaling:
As we can observe in the data snippet above, each RNA gene expression has a different range of values. Some values are in thousand denominations, and some are even binary numbers obtained after categorical encoding. To eliminate the bias introduced due to scale differences in the data, we standardized the entire dataset using a feature scaling technique called min-max scaling. Hence, this feature scaling approach keeps all the feature values between a standard range of 0 and 1.
# Transforming the data within a range e.g. [0, 1]:Feature Scaling
x = full_df_onehot_filter #returns a numpy array
min_max_scaler = preprocessing.MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(x)
genome_clinic_df = pd.DataFrame(x_scaled,columns=full_df_onehot_filter\
.loc[:, full_df_onehot_filter.columns != 'index'].columns)# Standardized data fame
index_df = full_df_onehot_filter.reset_index()
genome_clinic_std_concat = pd.concat([index_df['index'],genome_clinic_df],axis=1)genome_clinic_std_concat.set_index('index', inplace=True)
genome_clinic_std_concat.head()
As we see in the snippet above, feature scaling has normalized the high scale values also between 0 and 1.
Dimensionality Reduction with sparse PCA:
After the data transformations discussed above, we were still dealing with a very high dimensional feature space, implying that around 4200+ features describe the dataset. For K-Means clustering approach to function accurately, we should focus on a lower-dimension representation of the features. Hence, we applied one of the most renowned techniques — Principal Component Analysis (PCA), which can reduce the feature space of our data set from 4200+ columns to 2 principal components capturing the most variance explaining the structure of the dataset.
There are thousands of genes expressed in any given sample, but a patient may have very different genes expressed based on factors such as the tissue type, the sampling technique, and the time of sampling. This implies a lot of sparseness in the feature space among our samples. We have leveraged the sparse PCA module available in the scikit learn library in Python. This module helps to reduce the feature space to the most variation-contributing features (also called principal components or PCs), and also deal with the sparseness of the data. The following snippet explains how the data narrows down to the two most variation-contributing principal components.
# Dimensionality Reduction using Principal Component Analysis
from sklearn.decomposition import PCA, SparsePCAn=2
pcs = ['PC'+str(x) for x in range(n)]
pca = SparsePCA(n_components=n,max_iter=20,n_jobs=4)
principalComponents = pca.fit_transform(genome_clinic_std_concat)
#print(pca.explained_variance_)
principalDf = pd.DataFrame(data = principalComponents
, columns = pcs)principalDfConcat = pd.concat([index_df['index'],principalDf],axis=1)
principalDfConcat.head()
K-Means Clustering:
- Determining optimal number of clusters:
We now have a 2-dimensional representation of our data, which is conducive to the clustering process. K-Means Clustering is a popular clustering algorithm that segments data into K groups based on the underlying data patterns. We have leveraged the ‘sci-kit learn’ library for applying the clustering on the principal components. The information on the number of clusters, i.e., K needs to be known either by knowledge base or by Elbow method. We have approached the Elbow method to determine the best K clusters for our data set.
# Checking for best K when number of groups or clusters are not known - Used Elbow Plot.
import matplotlib.pyplot as pltdistortions = []
for k in range(1,11):
kmeans = KMeans(
n_clusters=k, init = "random",
n_init=10, max_iter=300, random_state=0
)
kmeans.fit(principalDfConcat.set_index('patient_id'))
distortions.append(kmeans.inertia_)
#plot
plt.plot(range(1,11), distortions, marker='o')
plt.xlabel("Number of clusters")
plt.ylabel("Distortions")
plt.show()
The Elbow plot above does not display an exact elbow structure. However, the first bend is at K=3, which implies that it can be assumed to be the best choice for the number of clusters. Thus, we can assume that these three clusters represent our underlying data.
2. Assessing similarity between cluster outcomes and cancer stages provided in clinical data:
At this stage, we applied the K-Means algorithm to group the patients into three groups 0, 1, and 2. These clusters represent entire data. But, they still do not indicate the cancer stage information of the patient.
# Clustering Model building using KMeans and concatenating labels with the corresponding patient
#from scipy import stats
from sklearn.cluster import KMeanskmeans = KMeans(n_clusters=3, random_state=0).fit(principalDfConcat.set_index('index'))
labels = kmeans.labels_#Glue back to original data
principalDfConcat['clusters'] = labels
cols = ['patient_id']
cols.extend(pcs)
cols.extend(['clusters'])
principalDfConcat.columns= cols
principalDfConcat.head()
To validate our initial hypothesis about the relationship between clusters representing the RNA gene expressions of the patients on one end and the liver cancer stages on the other end, we calculated the percentage of each cluster group and for each cancer stage.
#Validating relationship between clusters and liver cancer stages
df_stage_valid = pd.merge(df_stage[['index','tumor_stage']], principalDfConcat[['patient_id','clusters']], right_on='patient_id', left_on='index', how='left').set_index('patient_id') \
.drop('index', axis=1)#ensuring ID uniqueness
df_stage_valid.index = [x + '-' + str(i) for i,x in enumerate(df_stage_valid.index)]
df_stage_valid.head()
The code snippet above showcases how each representative cluster is spread across multiple cancer stages. We observe from the above results that gene expression variation does not fully capture the given tumor stages for the samples. It begs the question, what are the commonalities between these clusters? Hence, we will take a deep-dive to understand which hallmarks of cancer associate with each of the cancer stages using a supervised classification approach in our next post.
Stay tuned!