EconoBERT: Bridging the gap between NLP and Economics

Samuel Chaineau
5 min readJul 23, 2023

--

A bridge

Over the past month, I decided to start a personal contribution linking my two areas of “expertise” (in quotation marks because it’s never up to you to define yourself as a specialist): Natural Language Processing and economics. After many time spent looking for studies and articles treating of NLP applied to economics, I found it very old styled, poorly relevant and difficult to understand for the broader community. I decided to make a first step towards opening NLP to economics.

This article discusses one submission of mine on HuggingFace: EconoBert. It is a fine-tuned version of the vanilla BERT on speeches from Central Banks scrapped on the Bank of International Settlements’ website. By doing so, the embeddings of EconoBert have a better representation of the vocabulary used in economics. Shortly, I will submit a second derived model on top of the first : econo-sentence. It will aim at providing better embeddings for similarity tasks meaning that it better captures common topics in economics.

TLDR:

I built a BERT for economists.

Links of the model : https://huggingface.co/samchain/EconoBert

Link of the dataset : https://huggingface.co/datasets/samchain/BIS_Speeches_97_23

Why NLP x Economics ?

Economics is a fascinating science at the frontier of many fields. It requires people to study and understand complex systems implying agents that behave more or less logically. Its role in our modern society has been growing (no opinion if it’s for the better or the worst). 2008 and 2012 were the years where we deeply understood how bad things can go when the economy is unregulated.

Economics relies a lot on quantitative methods. A vast majority of students in economics (if not all) are familiar with tests, models and even some Machine Learning techniques. The economy is getting more and more complex, requiring more and more advanced and sophisticated methods to answer those challenges. Coming originally from a background in economics, I can say that it guided me towards Data Science and now Deep-Learning. Economics has the right mindset.

What is even more interesting is the incredible amount of textual data in the field. As monitoring the economy is key, the coverage outputs many reports, comments, articles and speeches. This amount of data cannot stay unused nor distilled in a global dataset making it less specific. Hence, there is a need to bring value to those files being alone on the web, in offices and SharePoints. Economics has the right data.

Some articles can be found where economists try some NLP tools like TF-IDF, LDA or Word2Vec methodologies for research. Few reach an acknowledgment and even fewer reach a regular use in the field. Economics has not the right maturity.

Economics needs to step in the NLP/transformers era, even more since the latest release of GPT-4, LLaMa and other LLMs. Using this kind of model can dramatically improve many studies and analysis. However, economists are not all confortable with those technologies neither the use cases. This shift will require commitments from both communities (deep learners and economists) and SMART objectives. Economics can reach this stage.

The first move:

As a first move, I decided to release one simple model that still can find numerous applications in various field of economics. The corpus of documents used is a complete scrapping of the BIS speeches, representing 18k speeches from 119 public financial institutions.

EconoBert:

EconoBert aims at adapting BERT to the specific vocabulary of economics by using central banks speeches as training data. This decision has two motivations:

  • Central Banks’ speeches tend to cover a broad range of topics (inflation, financial stability, employment, interest-rate, banking, GDP, crisis…)
  • Speeches are delivered frequently making it easy to refresh the model and adapt it to current trends

EconoBERT is a fine-tuned model of the famous “base-bert-uncased” on 12k speeches from central bankers representing approximately 33,8 M tokens. The detail of the procedure is available on the model card : https://huggingface.co/samchain/EconoBert

The related dataset is available also on HuggingFace : https://huggingface.co/datasets/samchain/BIS_Speeches_97_23

The BERT training scheme

The short, mid and long term philosophy:

Short:

The objective is not to create a second HuggingFace for economics or any kind of work duplication. The main objective is to use existing libraries, frameworks and any materials in an economist perspective. Every model will be published on HuggingFace. I wish to release various BERT-like models (and some text summarization models) on curated dataset with common training tasks (MLM, Question Answering, Text Summarization…). Also, it is definitely a strong interest to have multilingual models and datasets, enabling a wider and fairer representativity of economics concerns.

Mid:

Two objectives could be set.

The first objective is to provide simple APIs, annotated datasets, vizs and tools that could be used by beginners or people that don’t have a programming interest. Many applications could be found by asking academics and public institutions (topic modeling, sentiment, quarterly summaries…).

The second would be to tackle current LLM trend and try to define tasks that are suitable for a small LLM (LLaMa 7B for instance) for economics. This second point is definitely unclear for me and may not be kept.

Long:

Long term objective, in my opinion, is to converge more easily to multimodality between time-series (that us economists always tackle) and texts (that us deep-learners always tackle). This ultimate goal has to be better defined but would definitely mark the starting point of an era.

Of course, those milestones are not necessarily linked between them and may evolve through time. However, reaching a multimodal model making use of time-series and texts into a single representation space seems like a dream and a world of potential to me.

How to join the move ?

This paper has been written by me and only me at this point. The views and opinions expressed are my owns.

However, I would be more than pleased to discuss and exchange with students, professionals, economists, bankers, politicians and actually whoever feels enthusiastic about this project.

I will later see, depending if this paper finds some echo, how to structure a long term community.

Kudos to all and let’s nlpnomics begins :)

--

--

Samuel Chaineau

Data Scientist with original ideas. I like to mix things together !