Riding the Tailwind of NLP Explosion

Published in

CBI Engineering

7 min readApr 25, 2022

Frontline story from the R&D team @CB Insights

We ingest 2 million documents monthly at CB Insights (CBI) to empower tech decision-makers and researchers. As ML/NLP practitioners, we know far too well the holy grail behind this statement: data doesn’t turn into insights by itself; the very first challenge we often face is how to extract relevant information with scale, speed, and precision.

When we started at CBI, NLP was still prehistoric when the “bag of words” walked the earth. Fast forward ten years, the birth of the “attention mechanism” created an NLP explosion and a strong tailwind for teams big and small to ride.

In this 3-part series, we’ll share how we modernized our NLP stack @ CBI Delphi (yup that’s our R&D team), the challenges we wrestled with, and lessons learned. Part I gives an overview of CBI’s NLP stack following the “attention revolution” timeline — how we started experimenting with deep learning in NLP and where the pipeline stands today. Part II jumps into the trenches and shares lessons learned using transformer models across various tasks and languages, whether fine-tuning for financial NER or zero-shot for customer testimonial extraction. Part III peaks above the canopy and discusses where the field of DS is going and what it means for its fellow practitioners.

If you find them relevant or interesting, I’d love to hear from you.

Ok, here we go.

Part I — The Wind Rises

Language is the gateway to the vast majority of accumulated human knowledge. This is what the field of Natural Language Processing (NLP) aims to crack: enable machines to understand and use human language.

It starts with the problem of representation. As you read this post or listen to your favorite podcast, information is encoded into your brain, which can be retrieved and used later on in unforeseen contexts. What 50,000 years in evolution brings to human intelligence, we’re now building from scratch for machines: how to embed language in a vector space to capture the syntactic, the semantic, and ultimately fundamental concepts and relationships?

From Prehistoric to Bronze Age

The past decade has seen NLP traveling from prehistoric times to the bronze age in how it represents language.

Evolution of NLP from the Prehistoric to Bronze Age

Some ancient inventions never go out of style. In a sense, bag-of-words and TF-IDF are like paper: they remain pillars for modern NLP applications such as search engines. Come 2013, Word2Vec marks the beginning of the stone age, and people are still amazed by the king — man + women = queen discovery of vector space. Bronze age debuted with neural language models (2017) and prospered with the ‘attention mechanism.’ From Elmo to GPT, BERT and its thousand variations, it’s the era where language models rise rapidly with the scaling laws to the trillion-parameter weight class and achieve unprecedented generalizability. Super transformers are now omnipresent.

Timeline of the Attention Revolution & Supermodel Births

Three Pillars

The prosperity of the bronze age stands on three pillars: the ‘attention’ mechanism, self-supervised learning, and the power-law scalings.

At its core, ‘attention’ is a more effective way to encode information that alleviates critical problems sequential models face: slowness, long-range dependence, exploding/vanishing gradients, etc. It enables context-aware embeddings that are deeply bi-directional and can be learned in parallel.

At the same time, ‘self-supervision’ such as masked language pre-training, has unlocked the whole internet as a playground for language models. Its magical effect is no less than that I’ve taught my kid to read.

Finally, the empirical scaling law of language models strongly favors scaling model size N, given the increase in computing. The optimal model size increases six orders of magnitude, while the data required only increases by two.

Optimal allocation of increase in compute towards model size N and data requirement D

Since BERT, super language models have been descending at an unprecedented pace, starting out in the 100million parameter range and quickly growing to the trillion level. We’ve seen a paradigm shift from “supervised learning” to “pre-train, fine-tune” then to “pre-train, prompt, predict,” all within the past five years. The advancement in generalizability is astounding, and early trends in efficiency and explainability are exciting.

NLP Stack Modernization @CBI

So far, this hasn’t yet turned into a sad story of extinction but remains a happy one of increasing abundance.

As it currently stands, CBI’s NLP pipeline reflects this brief history of NLP evolution like canyon rocks. The pipeline starts with raw text data ingestion, goes through a chain of standardized processing, and fans out into various applications. Aside from the CoreNLP heavy lifting, the standardized pipe encompasses a familiar bag of tricks such as stopwords, regex, and string similarity. It’s been living and breathing like a crocodile with its prehistoric might. At the NLP application layer, we now have 20+ transformer models running in production, from classification to custom NER and QA.

The wind started to rise in Nov. 2018, when Google open-sourced BERT on GitHub. Having never touched a pre-trained language model and knowing only surface knowledge about the transformer architecture, I set out to experiment on a company hack day in spring 2019. The task is to predict news sentiment for companies based on topics ranging from ‘partnership’ and ‘product launch’ (examples of positives) to ‘lawsuit’ and ‘layoff’ (examples of negatives). The training set has 3447 data points. I had just spent a month on it to get decent precision-recall improvement. I thought 3k data might be a joke to a hundred million parameter model, but it doesn’t hurt to try. The result shocked me: this pre-trained neural language model learned from my tiny set of 2000 and generalized well. I got > 10 ppt boost in precision and recall within a couple of hours.

This has both piqued our interest and made it real. Compared to the few big and wealthy, many data science teams are ‘poor’ in the sense that they have a relentless focus on delivering practical results given resource constraints. Data annotation is often the first hurdle — not every project enjoys the luxury of abundant training data. And there are production and maintenance costs, not to mention a collective learning curve the team will have to climb. The pretrain-finetune paradigm shift brought by BERT is a game-changer. It’s a poor man’s deep learning dream come true.

Since that first hackathon experiment, we have fine-tuned and zero-shotted transformer models on various tasks, with success on dataset sizes ranging from the 100s to the 100,000s for the English language and beyond.

Tips for New Tech Adoption

Adopting a new tech in an organization is never easy. Reflecting on our own journey in modernizing the NLP stack at CBI, we have a couple of learnings to share.

experiment downstream and propagate upstream

For us, experiments started in the application layer. Upstream renovation is typically the problem of incremental efficiency, therefore a tougher sell as compared to something that unlocks new business capabilities.
On the other hand, downstream applications have clearer value prop and cause less ripple effect. Once we have amassed enough success evidence, newly accumulated knowledge and confidence make revamping the upstream easier.

2. Invest in building test harness and performance benchmarks

In most places that I’ve worked, people don’t spend enough time on test harness/performance benchmarks because they’re less exciting. Deploy-and-forget is a common practice. Nevertheless, harnesses and benchmarks are critical to the continuous improvement of ML products, and the lack of which is often the first roadblock to renovating an existing one. In software engineering, a maintainable codebase starts with good test coverage. The same should become a standard in machine learning.
A new hire should be able to replicate the existing performance benchmark, swap out any component/function in an ML pipeline and know its impact on performance right away. So if your train/val/test data is tucked away, the success metrics (both technical and product) poorly documented, the criteria for building the test harness nowhere to be found, it’s time to change that.

3. Learning fast and slow

Experiment fast. Going back to the hack day BERT story, not afraid to run quick/dirty experiments on something you don’t completely understand; even a production deployment could justify if all safety boxes are checked.
However, for any new knowledge to stick and prosper, there’s a learning curve that the team will need to climb collectively. Bootstrap with quick experiments, but understanding the fundamentals and propagating this knowledge goes a long way.

4. Phased transition

Even if we’re confident about a new solution, it’s almost always preferable to soft-release and conduct actual impact analysis than to simply cut out the existing workflow.
One transformer-heavy pipeline at CBI is company funding extraction from the news. It’s bread and butter data ingestion and demands high precision and recall. We had parallel systems running for all transformer experiments in this pipeline: human reviewers controlled if they wanted to turn on the new classifier. We analyzed performance data for months before decommissioning the old workflow.

— — —

That’s it for now. I hope you’ve stumbled upon something interesting here.

Stay tuned for Part II :)