Text Summarization for Topic modeling and clustering

Reduce bulky text to a short Summary

Gaurika Tyagi
Towards Data Science

--

This is a part 2 of the series analyzing healthcare chart notes using Natural Language Processing (NLP)

In the first part, we talked about cleaning the text and extracting sections of the chart notes which might be useful for further annotation by analysts. Hence, reducing their time in manually going through the entire chart note if they are only looking for “allergies” or “social history”.

NLP Tasks:

  1. Pre-processing and Cleaning
  2. Text Summarization — We are here
  3. Topic Modeling using Latent Dirichlet allocation (LDA)
  4. Clustering

If you want to try the entire code yourself or follow along, go to my published jupyter notebook on GitHub: https://github.com/gaurikatyagi/Natural-Language-Processing/blob/master/Introdution%20to%20NLP-Clustering%20Text.ipynb

DATA:

Source: https://mimic.physionet.org/about/mimic/

Doctors take notes on their computer and 80% of what they capture is not structured. That makes the processing of information even more difficult. Let’s not forget, interpretation of healthcare jargon is not an easy task either. It requires a lot of context for interpretation. Let’s see what we have:

Image by Author: Text Data as Input

Text Summarization

Spacy isn’t great at identifying the “Named Entity Recognition” of healthcare documents. See below:

doc = nlp(notes_data["TEXT"][178])
text_label_df = pd.DataFrame({"label":[ent.label_ for ent in doc.ents],
"text": [ent.text for ent in doc.ents]
})

display(HTML(text_label_df.head(10).to_html()))
Image by Author: Poor job at POS tagging in healthcare jargon

But, that does not mean it can not be used to summarize our text. It is still great at identifying the dependency in the texts using “Parts of Speech tagging”. Let’s see:

# Process the text
doc = nlp(notes_data["TEXT"][174][:100])
print(notes_data["TEXT"][174][:100], "\n")
# Iterate over the tokens in the doc
for token in doc:
if not (token.pos_ == 'DET' or token.pos_ == 'PUNCT' or token.pos_ == 'SPACE' or 'CONJ' in token.pos_):
print(token.text, token.pos_)
print("lemma:", token.lemma_)
print("dependency:", token.dep_, "- ", token.head.orth_)
print("prefix:", token.prefix_)
print("suffix:", token.suffix_)
Image by Author: Dependency identification

So, we can summarize the text; based on the dependency tracking. YAYYYYY!!!

Here are the results for the summary! (btw, I tried zooming out my jupyter notebook to show you the text difference, but still failed to capture the chart notes in its entirety. I’ll paste these separately as well or you can check my output on the Github page(mentioned at the top).

Image by Author: Summarized Text

Isn’t it great how we could get the gist of the entire document into concise and crisp phrases? These summaries will be used in topic modeling (in section 3) and the clustering of documents in section 4.

--

--

Data Scientist by Profession. Data geek by choice. Always learning. Deep Learning, Quantitative Machine Learning, NLP