Simple Natural Language Processing Projects for Health Sciences

Part 2: How to train fastText Word Embedding on a Standard treatment guideline)

Wuraola Oyewusi
Analytics Vidhya
6 min readDec 27, 2019

--

This tutorial is the second in a series of tutorials for people in health sciences doing Natural Language Processing. Part 1 here.
The principles are general principles and anyone can follow the train of thoughts, the datasets are just health sciences domain-specific.

In this tutorial we’ll :

train fastText word embedding on Nigeria’s 2008 standard treatment guidelines

See the effect of training fastText embedding models at 3 different epochs(10,20,30) on this particular text, the default is 5

Find semantically similar words generated based on the trained models.

Use the trained models to pick out words that do not match in a series.

Calculate the similarity between two words.

Create Visualization after dimensionality reduction using PCA on some semantically similar words(most of the code snippets for this is from this article by Usman Malik)

Colaboratory notebook here. The cell outputs were not cleared so outputs of each code snippet can be followed.

What are Standard treatment guidelines?

Standard treatment guidelines (STGs) list the preferred pharmaceutical and nonpharmaceutical treatments for common health problems experienced by people in a specific health system. As such, they represent one approach to promoting therapeutic effective and economically efficient prescribing.
Management Sciences for Health and World Health Organization. 2007. Drug and Therapeutics Committee Training Course

What is fastText?

fastText is a library for learning of word embeddings and text classification created by Facebook’s AI Research (FAIR) lab[3][4][5][6]. The model allows to create an unsupervised learning or supervised learning algorithm for obtaining vector representations for words. Facebook makes available pretrained models for 294 languages.[7] fastText uses a neural network for word embedding. Dec 2019, Wikipedia

The text was downloaded online, the indexes and appendices were removed and extraction from pdf to text was done using Python Tika.
The text was converted from lists to strings, lowercased and punctuations were removed.
Stop words were not removed and words were not lemmatized(I have not seen a good and easy to use lemmatizer for medical words yet, I’m not ready for words like gastitis to come out as gastiti`yet )

The 3 models were trained using the original fastText library, the only varying factor were the epochs(10,20,30), the time taken for each model training is noted in the code snippet. The models were saved as .bin files

For ease of use, the models were loaded using gensim’s FastText Implementation.

This is what the vocabulary looked like after training. It’s nice I didn’t have to tokenize, fastText handled that

Check the top 5 similar words to “dizziness”, it’s a popular medical symptom and side effect. So we’ll expect it to be predicted as being similar to other symptoms and words like drowsiness, fainting may be headache

print(“For 10 epochs model{}\n” .format(fasttext_trained_model_10.wv.most_similar([“dizziness”], topn=5)))

print(“For 20 epochs model{}\n” .format(fasttext_trained_model_20.wv.most_similar([“dizziness”], topn=5)))

print(“For 30 epochs model{}\n” .format(fasttext_trained_model_30.wv.most_similar([“dizziness”], topn=5)))

The 3 models did a good job predicting “drowsiness”,” Syncope”.
The 10 epochs model had the highest confidence in its predictions but the 20 epochs model predicted much more related words like “syncope”, “lethargy” ,
the 30 epochs model had good predictions too but there was no improvement in performance by training up to 30 epochs.

For 10 epochs model[(‘drowsiness’, 0.9582035541534424), (‘headache’, 0.936184287071228), (‘dryness’, 0.920276403427124), (‘headaches’, 0.9076910018920898), (‘nausea’, 0.8979116678237915)]

For 20 epochs model[(‘drowsiness’, 0.8123891353607178), (‘shortness’, 0.7257866859436035), (‘syncope’, 0.7250152826309204), (‘dryness’, 0.6867588758468628), (‘lethargy’, 0.661155104637146)]

For 30 epochs model[(‘drowsiness’, 0.6565455794334412), (‘shortness’, 0.64012211561203), (‘dryness’, 0.5671372413635254), (‘syncope’, 0.5500494241714478), (‘weakness’, 0.5497733354568481)]`

For the word pain, the most similar words were “pains”, “painless” appeared as expected, if the words were lemmatized, they will probably have just the root word “pain”.
The 20 epochs model captured words like “painful”, “complaints”. Cool.
The 30 epochs model didn’t give an extraordinary outcome.

For “Ciprofloxacin” the models were smart enough to give other antibiotics as similar words, the 20 epochs had details like associating “ciprofloxacin to one of it’s dosing regimen of “every12”. While preprocessing this text, numbers were retained because of dosing, medication strength which are very important in clinical texts.

In this example, I misspelt “gonorrhoea” as “gonorhhoea”.
All the models suggested the right spelling as an option, 30 epochs model even suggested the species name “ Neisseria”

Other interesting examples for top 5 semantically similar words.

For words that do not match. I tried [“paracetamol”, “headache”,” diarrhoea”,” dizziness”], paracetamol is the only drug all others are symptoms.
The 3 models predicted paracetamol is the odd one out.

All the drugs on this list are antihypertensives.[“Hydrochlorothiazide”, “Furosemide”,” Amlodipine”] but the first two are diuretics and the third a calcium channel blocker. The three models predicted the third as the odd one out.

Similarity distance between “drowsiness” and “dizziness” was predicted to be 0.95 by 10 epochs model, I’ll agree, they are close in meaning and use.

Similarity distance between drowsiness and amlodipine is about 0.53 by the 10 epochs model. Drowsiness is a popular side effect of amlodipine.

For visualization, generate top similar words for 10 words .

Do a dimensionality reduction of vectors from the 300 used in model training to two principal components using Principal Component Analysis from sklearn implementation

What an image of words looks like after dimensionality reduction, semantically similar words are close to each other.

Visualization after dimensionality reduction using PCA

I hope you had a great time working through this. As with all machine learning tasks. More quality data means better performance. The text data used for this is high quality because it’s a standard medical guide put together by professionals. To improve model performance more data can be used may be other medical references like pharmacopoeia, drug formularies, emdex etc.
Other parameters like learning rate, minimum word count can be tweaked.

For this task you will agree with me that the model with 10 epochs did just fine, so save your compute.

Happy Holidays

References:

--

--