Text Clustering- Identifying Relationships in Clinical documents

Gaurika Tyagi
Analytics Vidhya
Published in
3 min readJun 9, 2020

This is the final section of the 4 part series! Until now we have talked about:

  1. Pre-processing and Cleaning
  2. Text Summarization
  3. Topic Modeling using Latent Dirichlet allocation (LDA)
  4. Clustering — We are here!!

If you want to try the entire code yourself or follow along, go to my published jupyter notebook on GitHub: https://github.com/gaurikatyagi/Natural-Language-Processing/blob/master/Introdution%20to%20NLP-Clustering%20Text.ipynb

Input Data for Clustering (recap of section 3)

The text was cleaned and summarized using POS tagging and identifying a hierarchy of words in phrases. This was then fed into a Latent Dirichlet allocation (LDA) algorithm to get topics:

LDA output

In this section, we will transform the above into a dataframe for each text (input row) getting % propensity for every topic(features).

topics_all = pd.DataFrame.from_dict(document_topic, orient='index')
topic_column_names = ['topic_' + str(i) for i in range(0, 30)]
topics_all.columns = topic_column_names
display(topics_all.head())
Clustering Input

We can now visualize the hierarchy of these topics within chart notes! This will help in identifying the number of clusters we would need.

Visualization to identify # of Clusters

from scipy.cluster import hierarchy
plt.figure(figsize=(10, 7))
plt.title("Dendrograms")
dend = hierarchy.dendrogram(hierarchy.linkage(topics_all, method='ward'))
plt.axhline(y=9, color='r', linestyle='--')

The x-axis contains the samples and y-axis represents the distance between these samples. The vertical line with maximum distance is the blue line and hence we can decide a threshold of 9 and cut the dendrogram at that point(horizontal dotted line).

We have 4 clusters as this line cuts the dendrogram at 4 points. Let’s now apply hierarchical clustering for these clusters.

Clustering

Before we cluster. We should know what parameters you are choosing and why: LINKAGE determines which distance to use between a set of observations. The algorithm merges pairs of clusters that minimize the chosen criteria.

  1. ward minimizes the variance of the clusters being merged.
  2. average uses the average of the distances of each observation of the two sets.
  3. complete or maximum linkage uses the maximum distances between all observations of the two sets.
  4. single uses the minimum of the distances between all observations of the two sets.

I want to minimize the variance between the merged clusters. Now, I can only use Euclidean distance.

from sklearn.cluster import AgglomerativeClustering
cluster_model = AgglomerativeClustering(n_clusters=4, affinity='euclidean', linkage='ward')
cluster = cluster_model.fit_predict(topics_all).tolist()
# print(type(cluster))
topics_all["cluster"] = cluster
##VISUALIZZATIONdf_for_h_visual = df ## derived from topics_all to get 1 topic per text- refer to part 3 of this series.
df_for_h_visual["cluster"] = topics_all["cluster"]
df_for_h_visual.drop(['propensity'], axis = 1, inplace=True)
df_for_h_visual.topic.fillna(value="Unknown", inplace=True)
df_for_h_visual.head()
1 topic per text and its cluster

Visualizing the Clusters

df_histo = df_for_h_visual.groupby(['topic','cluster']).count().reset_index()
df_histo = df_histo.pivot(index='topic', columns='cluster', values='SUMMARY')
df_histo.columns = ["c0", "c1", "c2", "c3"]
ax = df_histo.plot.bar(stacked=True, colormap='inferno', edgecolor='black', linewidth=1)ax.legend(loc='center left', bbox_to_anchor=(1.0, .5))
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.spines['bottom'].set_visible(False)
ax.spines['left'].set_visible(False)
plt.show()

Conclusion

~50 documents from the “Unknown” category of modeled topics actually are in the same domain space as topic_23 and topic_5(cluster 1).

In this 4 part tutorial, you have seen how to go from raw text to summarized text. This was then modeled for topics and finally, these topics were clustered together to identify the relationships amongst text that was close but, not the same topic. This also helped in determine “potential topics” for previously unidentified clinical notes. I hope you enjoyed this!

--

--

Gaurika Tyagi
Analytics Vidhya

Data Scientist by Profession. Data geek by choice. Always learning. Deep Learning, Quantitative Machine Learning, NLP