Unlocking the Power of Medical Text: Insights from our Internship Project

Published in

d*classified

10 min readMay 18, 2023

This article recaps the internship journey of Kayla Marsh and Eugene Ang, Year 5 students from Temasek Junior College, who had the opportunity to join DSTA as an intern in January 2023, working with their mentor, Data Scientist Lim Kai Le on a Medical Notes Classification task.

Why DSTA?

Kayla: I wanted to try something that I had not done before, and was considering taking Computing as a subject. I felt that this was a good opportunity for me to learn coding and at the same time determine for myself if this is a career that I would consider pursuing. Although I ended up not choosing to take computing subjects in JC, I am definitely interested in coding and am considering a computer science related career in the future. I am aiming to join DSTA’s Brainhack competition in May-June this year to learn more about programming.

Kayla Marsh (bottom row, rightmost) and Eugene Ang (top row, rightmost), with their mentor Lim Kai Le (bottom row, leftmost).

Eugene: I chose to pursue an internship in DSTA, and this project in particular as I wanted to continue pursuing my interest in coding and machine learning. Before this short internship at DSTA, I did a research project at DSO, evidence of my keen interest in defence technology, as well as Machine Learning. I also want to pursue some computing related career, especially in machine learning and AI in the future. Thus, this was a golden opportunity for me to expand my knowledge on machine learning, as well as understand more of the work culture in a tech-related workspace.

What exactly did we do there?

We were tasked to modify a simple machine learning model to perform multi-class classification and predict medical notes into 3 categories, Heart Attack (label 0), Coronary Atherosclerosis (label 1), and Aortic Valve (label 2), through the use of natural language processing. Python was the main programming language used in the project.

Machine learning involves the creation of a model that can ‘learn’ from previously available data to classify and predict future data. If refined, Machine Learning (ML) could potentially be useful in reducing time wasted on processes that can be easily streamlined. (E.g. Facial recognition technology uses ML to allow people to quickly unlock their phones). ML is interesting as it can analyse large chunks of data and simplify it for us to see i.e clustering.

In the project, we were tasked to modify the codes to improve its Precision, Recall and F1 Score metrics, and to allow the addition of the accuracy metric that is used to check the percentage of data points that have been predicted correctly. The model that was used to gain the best results was the Linear SVC model, that will be elaborated on in Section 4.

For the evaluation of our model, we used the confusion matrix, which is a method for analysing the success or failure of a machine learning model.

A typical binary classification confusion matrix

This confusion matrix is used as a way to compare what domains the data points in the test dataset (explained in part 2) have been predicted to be in, and what they are actually supposed to be in.

Let us assume that this is referring to a set of COVID-19 Tests.

True positives (TPs) refer to data points that have been assigned to the positive domain accurately (E.g. If someone has COVID-19, the test showed a positive result)

False Positives (FPs) refer to data points that have been assigned to the positive domain even though it is supposed to be in the negative domain (E.g. Someone testing for COVID-19 does not actually have COVID-19 but tested positive for it)

False Negatives (FNs) refer to data points that have been assigned to the negative domain even though they are supposed to be in the positive domain (E.g. Someone testing for COVID-19 has COVID-19, and should have tested positive but tested negative for it)

True Negatives (TNs) refer to data points that have been assigned to the negative domain and are supposed to be assigned to the negative domain (E.g. Someone testing for COVID-19 is supposed to have tested negative and actually tested negative for COVID-19)

Precision refers to the measure of the TPs (E.g. Out of all those who tested positive for COVID-19, how many of them actually have COVID-19?).

The above equation is a way to calculate the precision value.

Recall refers to the measure of how many of those positive cases that actually belongs to the Positive domain (E.g. Out of those who have COVID-19, how many of them tested positive for COVID-19?)

F1 score, on the other hand, provides a sort of average of these 2 figures, as seen in the equation below.

Natural Language Processing

To begin, Natural Language Processing (NLP) has 4 main steps, as seen in this image below.

1. Text Pre-processing

The first step in NLP is text pre-processing, which ensures that the text is cleaned up, and also reduces the number of unique words to be fed into the machine. For the cleaning up of text, the main way of doing this is through the removal of characters, and listed below are some examples:

Punctuations — ! , # $ %
Numbers — 12345
Leading and Trailing spaces — This describes the removal of spaces before the first character in each sentence and after the last character in each sentence.
Stopwords — Refers to words that have no relevant contextual meaning to the sentence (i.e. ‘The’, ‘He’, ‘I’, ‘And’)
Stemming — Removes suffixes for words to return them to its original form, and ignores any potential contextual meaning that it had
Lemmatizing — Removes suffixes for words while taking into account their contextual meaning.

This table below provides a simple comparison between stemming and lemmatizing.

As you can see, lemmatizing tends to better preserve the original meaning of the word than stemming.

One thing to take note is that different combinations of text pre-processors have different effects on different types of metrics, and using them all at once may not provide the best results according to the metrics used. For example, in our code, we used all pre-processors with the exception of stemming for the reason stated above — lemmatizing, unlike stemming, is able to better preserve the original meaning of the word. To continue, additional stopwords were added to the stopwords list in order to remove more unnecessary words.

2. Dataset splitting

This involves the division of the current dataset into at least 2 subsets: the train and test dataset, to ensure that the model cannot “memorise” the classification of the data points. It can be compared to doing practice questions in school, and doing the examination itself. We cannot use the exact same questions in the examination as compared to the practice as it means that students end up memorising how to answer the paper instead of understanding it, preventing them from learning. In the same vein, this “memorisation” is known as overfitting, which results in the metrics for the model to be excellent with the current dataset it trained with, but to give poor results when different datasets are used for testing.

In the original code, it was split into two sets, the training set and the test set. However, this meant that whatever modifications were done to the train set were also done to the test set, preventing experimenting with the training set that could improve the metrics. To combat this, we added in a validation set (initially a part of the training set), that allowed us to test and tune the model before using it on the final test set. To be more specific, it is a way to validate model performance during training before applying it to the final test set, as seen in this figure below.

3. Text embedding

These embedders transform unstructured text into a representation that the computer can recognise. They also help ensure that words with similar meaning have similar encoding ensuring that the words can be used interchangeably.

Some examples that we used in code were term frequency-inverse document frequency (TF-IDF). It counts the number of words and importance of the words in the whole set ensuring that its importance and relative frequency can be effectively counted through the use of an equation, as shown below:

Another simple test embedder that we attempted to use was the count vectorizer, which operates in a similar manner to the TF-IDF vectorizer but does not count the importance of the words. As such, it was not as effective as TF-IDF in improving the metrics (precision, f-1, recall, accuracy).

In addition, we were also able to tweak some of the hyperparameters, max_features, max_df and min_df to change how the words were represented.

4. Models used

The models we used were Support Vector Classifiers (SVC) and Gradient Boosting Classifiers, such as Light Gradient Boosting Classifier, and CatBoost Gradient Boosting Classifier. From here, we also used Grid Search, to determine the best hyperparameters for our classification task. Our results varied across models, and the Linear SVC (LSVC) model performed the best, with a F1 score of 0.854, and the confusion matrix as shown below.

Labels: 0 — Heart Attack, 1 — Cardio Atherosclerosis, 2 — Aortic Valve

What we learnt

Kayla

Coming into this internship program, I did not know what to expect especially since I did not have much experience in machine learning. However, I was excited to see what would happen as I was looking forward to learning more about coding and applying it to machine learning. The program did not disappoint me as I gained a much clearer understanding of what machine learning is and helped me improve my coding skills by a large degree from almost none. I enjoyed learning how to create a basic machine learning model using SKLearn module.

I did not merely learn in terms of how to program and machine learning, but also learnt important lessons that can be applied to the future. For example, I understood more about the importance of organising my work and cleaning it up to ensure that it is readable.

In terms of the working environment, it has an open space concept that allows for easy collaboration with others, which allowed me to have an easier time when asking others for help since I did not feel as awkward trying to talk to them. The mentor, Lim Kai Le, was friendly and did not judge me despite the many questions that I asked, and made concepts easier to understand.

I was hoping to use this internship as a way to explore potential future careers, especially since I am still not sure as to what I would like to do as a career. From this, I have gained an interest in Data Science and will consider it as an option in the future.

I enjoyed the program and hope that I am able to do more related to this by attending other DSTA and coding related programmes in my free time that will allow me to learn more about it.

Eugene

Prior to this work experience programme in DSTA, I already had interest in computer science, and data science, namely machine learning (ML), with experience in Computer Vision (Convolutional Neural Network — CNN), as well as experience using linear regression models, K-means, K-clustering, with exploring Natural Language Processing (NLP) in the back of my mind. Through this internship, I had gained valuable insight into NLP, learning about the different skills used in NLP, the general pipeline of NLP, as well as different models that are used for NLP. This further confirmed my interests in the field of data science and ML, as well as my goals of working in this field in the future.

In regards to the work culture, from prior internships that I have gone through, I had already immersed myself in the general workplace culture, with friendly and helpful mentors and colleagues, so not much insight there. However, the environment at DSTA felt very fresh, as the workspace was open concept, with no cubicle walls, which was an office space style that I had never worked in before in my other internships. This made me feel like I was a part of DSTA, rather than a short-stint intern, especially with the friendly colleagues around us.

Overall, I found this experience fulfilling and enriching, taking away new knowledge and skills, as well as immersing myself in a different kind of office environment.