Machine learning models are using increasingly large numbers of parameters and floating point operations (the GPT-3 model has 175 billion parameters, while the Switch Transformer has 1 trillion weights), and the trend is likely to continue given that scaling up is still one of the most sure-fire ways of increasing performance. However, these numbers demand a careful Return-over-Investment (ROI) analysis. Sure, by training a model that is 100x bigger than your current one you might gain 10% in accuracy on some academic benchmarks, but is it worth the financial and environmental cost for your not-yet-FAANG startup?

A standard ROI analysis…


Google Brain’s language model that switches itself on and off

In the last three years, Transformer-based language models (LMs) have been stealing the show in the natural language processing (NLP) world. Transformers are usually huge networks pre-trained on massive amounts of unstructured text, capturing generally useful linguistic properties. Pre-trained models can then be fine-tuned for a myriad of end-tasks like question answering or machine translation, even on modest amounts of labeled data (see this article for the latest trends in pre-training LMs). The T5 model, Google’s record holder on multiple NLP benchmarks, was recently outranked by its own Switch Transformer.

Photo by Thomas Jensen on Unsplash

Not all knowledge is useful all the time. This observation…


A BERTology contribution by Huawei

The ongoing trend of building ever larger models like BERT and GPT-3 has been accompanied by a complementary effort to reduce their size at little or no cost in accuracy. Effective models are built either via distillation (Pre-trained Distillation, DistilBERT, MobileBERT, TinyBERT), quantization (Q-BERT, Q8BERT) or parameter pruning.

On September 27, Huawei introduced TernaryBERT, a model that leverages both distillation and quantization to achieve accuracy comparable to the original BERT model with ~15x decrease in size. What is truly remarkable about TernaryBERT is that its weights are ternarized, i.e. …


An intuitive high-level overview

Memory-Augmented Neural Networks (MANNs) were introduced in 2014 by two concurrent research efforts: Neural Turing Machines and Memory Networks. Since then, they expanded into a broader topic that spans beyond these original implementations. However, I will stick to a high-level intuitive overview. This article is meant to distill the last 7 years of research into a 7-minute read, removing paper-specific terminology and implementation details that didn’t pass the test of time.

Memory-Augmented Neural Networks (MANNs) are differentiable versions of the von Neumann architecture. …


A BERTology contribution by ByteDance (yes, the TikTok people!)

Just when we thought that all name variations of BERT were taken (RoBERTa, ALBERT, FlauBERT, ColBERT, CamemBERT etc.), along comes AMBERT, another incremental iteration on the Transformer Muppet that has taken over natural language understanding. AMBERT was published on August 27 by ByteDance, the developer of TikTok and Toutiao.

AMBERT proposes a simple twist to BERT: tokenize the input twice, once with a fine-grained tokenizer and once with a coarse-grained tokenizer.

This article is mostly a summary of the AMBERT paper meant to distill the main ideas without the nitty-gritty details, but I will occasionally chime in with personal observations…


Even the most popular NLP benchmarks are facing these challenges

Photo by Jachan DeVol on Unsplash

Garbage in, garbage out. You don’t have to be an ML expert to have heard this phrase. Models uncover patterns in the data, so when the data is broken, they develop broken behavior. This is why researchers allocate significant resources towards curating datasets. However, despite best efforts, it is nearly impossible to collect perfectly clean data, especially at the scale demanded by deep learning.

This article discusses popular natural language datasets that turned out to disobey fundamental principles of machine learning and data science, despite being produced by experts in the field. Some of these flaws were exposed and quantified…


The Uncertain Future of Token Prediction

Photo by Patrick Tomasso on Unsplash

Pre-training is now ubiquitous in natural language understanding (NLU). Regardless of the target application (e.g., sentiment analysis, question answering, or machine translation), models are first pre-trained on vast amounts of free-form text, often hundreds of gigabytes. The intention is to initialize models with general linguistic knowledge that can be later leveraged in multiple contexts. A pre-trained model that is linguistically well-versed can then be fine-tuned on a much smaller dataset to perform the target application.

While we’ve emphatically determined the usefulness of exposing a model to endless Internet blabber, it’s still not obvious how the model should interact with it…

Iulia Turc

iuliaturc.com | Software Engineer | Natural Language Processing Research

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store