Demystifying ‘black box’ methods of text feature extraction from Rx Data
Transformation of unstructured text to normalised, structured data suitable for analysis and usage in ML algorithms can be achieved using Natural Language Processing. For us in Health & Wellness Data Science team, we wanted to derive some meaningful insights from select texts in electronic prescriptions.
E-prescribing is a technology framework that allows physicians and other medical practitioners to write and send prescriptions to a participating pharmacy electronically instead of using handwriting on paper or calling the pharmacy directly.
At Walmart Pharmacies, we process millions of prescriptions at 5000+ locations across the country. ~15% of all prescriptions come with additional notes written by the doctor. These notes may or may not contain critical information related to the directions for the drug.
Many parts of our Pharmacy System have been optimized for electronic RXs and simplified for use by pharmacist and technician. Currently, there are business rules in place which ensures that if additional notes are present, a blue highlight is placed on the notes section to draw attention, as shown in sample image below.
Currently, this does not distinguish between notes on importance. We want to ensure that the pharmacist does not miss reading any notes with critical information and on the other hand, we also want to minimize the highlighting of non-clinical data to avoid desensitisation to the blue highlighting. Naturally, if we get this right, we get multiple advantages in terms of improved customer experience, reduced labor efforts, and enhanced clinical quality.
Thus, our objective in this experiment was to develop a multi-class classification model which can ingest the note text data and classify the notes into three classes based on the criticality of information for the drug direction…
1. Class 0 — Critical
2. Class 1 — Not Critical
3. Class 2 — Maybe Critical
The pharmacist would only need to read the notes highlighted ‘Critical’ or ‘Maybe Critical’ to extract any meaningful information.
The challenge was to build a model which would not only have a high accuracy, but near 100% recall for Class 0. Classifying a ‘Non-Critical’ note as ‘Critical’ does not involve any potential risk to the patient and only increases the processing time. However, if a ‘Critical’ Note is incorrectly classified as ‘Non-Critical’, the pharmacist could potentially miss out on important drug related information.
For building the model, we used a sample of 15,000 notes labelled by pharmacists into the three classes. We converted all the text into numerical features and then split the dataset into training and testing purpose with a 80:20 ratio. As the text data is very specific to healthcare and medications, we could not use pre-trained vector space models such as GloVe, Genism’s word2vec trained on Google News Dataset etc.
To create the vector space models, we used a corpus of 1 million unlabelled notes from 1 year. The text was preprocessed to remove unwanted symbols or numbers, normalized case conversions and performed stemming which reduces words to their base word. Using our feature models built from the 1 million corpora, we transformed our training and test set into numeric features.
For feature extraction we focused on three approaches…
1. Bag of Words
The simplest features to represent a set of documents is a collection of words with word count, disregarding the order in which they appear. For texts with large vocabularies, BOW vectors can be very large and sparse, making computation less efficient.
2. TF-IDF
One of the most popular feature extraction methods is Term Frequency-Inverse Documentary Frequency. Intuitively, this calculation determines how relevant a given word is in a particular document. For example, if a word like ‘patient’ occurs in almost all the notes, its value will be scaled down and the value of rarer words will be higher.
3. Word2Vec
“You shall know a word by the company it keeps.”– J.R. Firth
The goal of word2vec is to predict, given a particular word say ‘tablet’, which words have the highest probability to appear in the vicinity or ‘window’ of the word ‘tablet’. Also, it can predict the target word ‘tablet’ given a context or the words in a window, for example ‘give 1 _____ for 30 days’, . The first variation is called the skip-gram model and the second variation is called ‘continuous bag of words’ model.
If the vocabulary has 50,000 words, each word will be represented by a one-hot input vector. This is passed through the hidden layer. The output layer is the softmax layer, which outputs the probability for each word in our vocabulary that it can randomly occur within the window of ‘tablet’.
The next step is to feed the numeric text features into different classification models and compare the results. We have used different types of classifiers such as Logistic Regression, Random Forest, SVM and Neural Networks with each of the three feature models. Hyper-parameters for each of the classification models were tuned through grid search. For each feature extraction technique, we have selected the best model on the basis of the highest accuracy and ‘Class 0’ recall values.
Additional Dictionary Check
Along with the classifier, an additional safety layer of check was provided in which every note would be referenced with a dictionary of ‘critical keywords’ provided by the Quality team. Any note containing a ‘critical keywords’ such as ‘stop immediately’, ‘renew appointment’ etc. would also be marked as critical. In case of a conflict between the classifier and keyword search, the keyword search would take precedence. This leads to very high recall for ‘Critical’ class.
When a prescription is processed, the pharmacist can see the prescription details along with the note on the e-prescribing input screens. Out of 15% prescriptions with notes, ~11% notes possess critical information and these need to be marked specifically so that the pharmacist does not miss them. To enable this, an intuitive user interface will leverage the classifier+keyword prediction model calling attention to the ‘Critical’ notes while more subtly making ‘Non-Critical’ notes present. This method will ensure the most critical information is surfaced and highly visible, thus improving the current input and verification processes. To further improve the accuracy and recall, we are considering ensemble techniques for feature extraction and classification.
This work was done by our team comprising of Jingying Zhang, Paridhi Kabra, Qiming Chen and myself. I also want to thank Vinay NP for his guidance, Mike Sapoznik for a very detailed review, and Naresh Patel for his diligence in labelling the notes.