Building Machine Learning models with BERT

Emma Findlow
Wallscope
Published in
4 min readAug 11, 2020

Darwon Rashid is a Machine Learning Engineer at Wallscope. As part of our R&D work he’s been working with Cirrus, a super-computer at the Edinburgh Parallel Computing Centre (EPCC) to develop an AI model that recognises medical entities. Situated within the University of Edinburgh, EPCC has been an international centre for excellence in high-performance computing for 30 years.

I chatted to Darwon about his work, the challenges he’s faced and how having a supercomputer to play with has helped…

Darwon (right) deep in concentration at a Wallscope event

Hi Darwon! Can you tell us a bit more about what you’re working on at the moment?

An AI language model that can recognise medical entities (such as diseases, drugs etc), so that we can then link them to the appropriate medical codes.

This is very valuable as it provides automation and reduces the burden of work on GPs and other medical professionals. It also creates a powerful tool for extracting medical notes.

Right now, I’m also working on building a general purpose entity finder which will recognise locations, organisations, and other general entities (18 different ones to be exact). The focus right now is on health data, but ultimately the idea is to train it with the appropriate data to recognise entities from various domains of knowledge.

And what Machine Learning models are you using?

The machine learning model that is used is called BERT. This stands for Bidirectional Encoder Representations from Transformers and it is a technique for NLP pre-training developed by Google.

BERT uses the mechanism of “paying attention” to better understand the contextual relationships between each word (or sub-words) in a sentence.

I’d like to know more about how BERT pays attention!

For example, consider the phrase “It looked like a painting”. BERT looks at the word “it” and then checks its relationship with every other word in the sentence. This way, BERT can tell that “it” refers strongly to “painting”. This allows BERT to understand the context of each word in a given sentence. The model understands the difference in the word “bank” between phrases “I am by the river bank” and “I just withdrew money from the bank”. BERT can tell that the word “bank” has different contextual meanings in the two different phrases. This blog post explores BERT in more technical detail.

A screenshot of BERT’s attention mechanism on an input example

For example, as BERT processes the input above, we can see which words are focused on more when considering the word “it” in the sentence. BERT focuses more on the words “animal”, indicating that BERT has contextual understanding between the words in the sentence.

What are the main challenges you’ve faced during this work?

BERT is a relatively large model and it takes time for it to train (the largest model has 340 million trainable parameters). Running experiments and testing out hypotheses can be slow and time-consuming when running on poor hardware. Since BERT is also a relatively new model, finding reliable resources and guidelines can be hard. The community is still finding new insights into BERT’s performance. I spend a lot of my time “trying things” out in the hope of increasing its performance — when that process becomes time consuming, it can bog down development.

Cirrus logo

And how has the collaboration with EPCC helped you with this?

Using a supercomputer speeds things up by quite a lot! An average computer might have 4 physical cores and 8gb of ram; this supercomputer has 280 computer nodes. Each node gives you access to 36 physical cores, 256gb ram, and four GPUs. When Cirrus was finally set up to fine-tune BERT using four GPUs, training time went from around 2 hours for a single epoch to around 10 minutes for a single epoch — so a drastic difference! As I could run experiments simultaneously on Cirrus, I was able to test out multiple hypotheses and different parameters concurrently.

What’s been your favourite part of working on this project?

I get to work on state of the art technologies, I have the freedom to do what I find interesting — I’m like a kid that gets to play with very cool toys! Also I’m very focused on R&D so this project fits me well.

What are the next steps?

Train other complicated models if I have a supercomputer to play with! There are two other language models called RoBERTa and XLNet that I would like to explore to compare their performance to that of BERT’s with the same task.

In general what kind of problems are you personally most interested in solving with these types of technologies?

Because Wallscope works a lot with semantic technologies, my hope is that the industry catches up and goes in this direction.

The power of this technology is that it is applicable across so many areas of knowledge. There are a lot of businesses sitting on information and not knowing what to do with it. We thrive in solving unstructured data business problems.

https://wallscope.co.uk

--

--