Exploring the oft-neglected building block of transformers

Image for post
Image for post
Credits: Unsplash

The truth is, tokenizers are not that interesting. When I first read the BERT paper, I raced past the WordPiece tokenizing section because it wasn’t as exciting as the rest of the paper. But tokenization has evolved from word to sub-word tokenization and different transformers use different tokenizers which are quite a handful to understand.

There are already good articles that discuss and explain tokenizers —the ones I like the most is a detailed blog post by FloydHub and a short tutorial by Hugging Face.

Instead, I want to focus on application — specifically how tokenizers of different models behave out of the box, and how that affects our models’ ability to comprehend. If you start your NLP task with a pre-trained Transformer model (which usually makes more sense than training from scratch), you are stuck with the model’s pre-trained tokenizer and its vocabulary — knowing its behaviour and quirks allows you to choose the best model and debug issues more easily. …

Image for post
Image for post
Source: Unsplash

Do we still need humans to read boring financial statements?

Machine learning models aim to learn a good representation of its input data to perform its task. The way models learn to represent words with Natural language processing (NLP) has evolved in recent years, and in this article, we explore the notable changes in how models understand language to make financial decisions.

Image for post
Image for post

We focus on a direct application of NLP to financial markets: Automate sentiment classification of a text document to make fast and accurate investment calls that are free of human bias.

For us to compare meaningfully across the technology shifts, we use a Financial Phrase Bank[1] that contains labelled financial phrases (example…

Explaining evaluation metrics in basic terms

Machine learning terms can seem very convoluted, as if they were made to be understood by machines. Unintuitive and similar sounding names like False Negatives and True Positives, Precision, Recall, Area Under ROC, Sensitivity, Specificity and Insanity. Ok, the last one wasn’t real.

There are some great articles on precision and recall already, but when I read them and other discussions on stackexchange, the messy terms all mix up in my mind and I’m left more confused than an unlabelled confusion matrix — so I’ve never felt like I understood it fully.

Image for post
Image for post
A confused confusion matrix

But to know how our model is working, it is important to master the evaluation metrics and understand them at a deep level. So what does a data scientist really need to know to evaluate a classification model? I explain the most important ones below using visuals and examples so it can stick in our brains for good. …


Neo Yi Peng

I condense insights/ideas from my machine learning experiments in the markets. | LinkedIn: https://bit.ly/3dsThtb

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store