Finance NLP 2.0: Better embeddings, better models

The new release comes with a redesign of important models and improved performance on financial NLP tasks.

David Cecchini
John Snow Labs
3 min readJul 12, 2024

--

Photo by Annie Spratt on Unsplash

New state-of-the-art embedding models

Current advances in NLP are built on top of vector representation of text, usually called embeddings. To provide the capabilities to use, train, and finetune models for Finance, Finance NLP is releasing two categories of embedding models: word embedding models that create vector representations of words (or tokens), and sentence embedding models that create vector representations of longer chunks of text (e.g., sentences, paragraphs, documents).

Word Embedding model

For word embedding models, we aimed to create fast and lightweight models that can be used on token classification tasks such as Named Entity Recognition (NER). The newly released model has a dimension of 200 and was trained from scratch on a mix of financial and general texts, allowing the model to learn usual English words, but with a special focus on financial terms.

The architecture is based on Word2Vec, which provides the best tradeoff between representation capabilities and speed, allowing for fast and accurate models built on top of it (e.g., NER, Relation Extraction).

To use the model in Finance NLP, use the pretrained method of the corresponding annotator:

word_embedding_model = (
nlp.WordEmbeddings.pretrained(
"finance_word_embeddings", "en", "finance/models"
)
.setInputColumns(["sentence", "token"])
.setOutputColumn("word_embedding")
)

With the new word embedding model, we trained new models and improved existing models:

  • De-identification model: finner_deid_sec_fe
  • NER models: finner_sec_edgar_fe , finner_aspect_based_sentiment_fe , finner_financial_xlarge_fe
  • Assertion models: finassertion_aspect_based_sentiment_md_fe

The new models achieve similar or better performance than the existing ones and are optimized with new embedding models with better generalization capabilities. While some models will demonstrate improvements up to 15%, others may have reduced performance but will generalize better to different datasets.

Sentence Embedding model

Now for the sentence embedding model, we based our model on the BGE architecture and trained from scratch on our own domain-specific datasets together with public general English datasets. The model maps sentences/documents/paragraphs to a vector of 768 dimensions and is designed to improve RAG applications.

To use the model, use the corresponding annotator:

sentence_embeddings = (
nlp.BGEEmbeddings.pretrained(
"finance_bge_base_embeddings", "en", "finance/models"
)
.setInputCols("document")
.setOutputCol("sentence_embeddings")
)

With the new model, we improved the following ones built with it:

  • Improved entity linking (resolver) models: finel_edgar_company_name_fe , finel_nasdaq_company_name_stock_screener_fe , finel_names2tickers_fe , finel_tickers2names

With better sentence embedding models, Finance practitioners, researchers, and developers can design better text classification, entity linking, and RAG systems.

To demonstrate the higher capabilities of the sentence embedding model, we tested the retrieval recall on the FiQA dataset. By comparing the top 5 retrieved documents with the associated ones in the dataset, we obtained 20% and 300% recall improvements when comparing with the base model (not finetuned on domain-specific data) and on domain-specific sentence transformers model.

Conclusion

Finance NLP 2.0 arrives with better capability to train specialized models based on performant embedding models. Practitioners and data scientists in this area can benefit from them. All models can be found in the Spark NLP Models Hub.

We are motivated to start releasing new specialized models in future releases, stay tuned.

Fancy trying?

We’ve got 30-day free licenses for you with technical support from our financial team of technical and SMEs. This trial includes complete access to more than 150 models, including Classification, NER, Relation Extraction, Similarity Search, Summarization, Sentiment Analysis, Question Answering, etc., and 50+ financial language models.

Just go to https://www.johnsnowlabs.com/install/ and follow the instructions!

Don’t forget to check our notebooks and demos.

How to run

Finance NLP is relatively easy to run on both clusters and driver-only environments using johnsnowlabs library:

Install the johnsnowlabs library:

pip install johnsnowlabs

Then, on Python, install Finance NLP with

from johnsnowlabs import nlp


nlp.install(force_browser=True)

Then, we can import the Finance NLP module and start working with Spark.

from johnsnowlabs import nlp, finance


# Start Spark Session
spark = nlp.start()

For alternative installation methods of how to install in specific environments, please check the docs.

--

--