SentenceTransformer : Text Embeddings Model

Published in

axinc-ai

3 min readDec 9, 2023

This is an introduction to「SentenceTransformer」, a machine learning model that can be used with ailia SDK. You can easily use this model to create AI applications using ailia SDK as well as many other ready-to-use ailia MODELS.

Overview

SentenceTransformer was built by fine-tuning the language model BERT in order to output high quality text embeddings.

OpenAI’s text-embedding-ada-002 is a well-known API for obtaining good text embeddings, but if you do not need that level of accuracy, you can use SentenceTransformer, which can be run offline without API fees.

The multilingual model is 1.1 GB in size.

sentence-transformers/paraphrase-multilingual-mpnet-base-v2 · Hugging Face

This is a sentence-transformers model: It maps sentences & paragraphs to a 768 dimensional dense vector space and can…

huggingface.co

Architecture

SentenceTransformer computes the embedding of a text by pooling the embedding of each token of BERT and fine-tuning it to minimize the distance of embeddings between texts of the same meaning.

SBERT strands for Sentence-BERT. Source: https://arxiv.org/pdf/1908.10084.pdf

If we simply use the average of embedding per token in BERT, the accuracy is 54.81 in average, but with fine-tuning, the accuracy improves to 76.68.

Source: https://arxiv.org/pdf/1908.10084.pdf

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

BERT (Devlin et al., 2018) and RoBERTa (Liu et al., 2019) has set a new state-of-the-art performance on sentence-pair…

arxiv.org

Multilingual Model

SentenceTransformer has a published multilingual model paraphrase-multilingual-mpnet-base-v2. Embeddings are available in 50+ languages, including Japanese, in 768 dimensions.

The multilingual model uses knowledge distillation to convert multilingual embeddings into monolingual embeddings.

Source: https://arxiv.org/pdf/2004.09813.pdf

Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation

We present an easy and efficient method to extend existing sentence embedding models to new languages. This allows to…

arxiv.org

Tokenizer

SentenceTransformer uses XLMRoBERTa’s tokenizer. Sentence Piece is used, and XLMRoBERTa tokens can be obtained by rearranging the main symbols in the SentencePieceProcessor.

Usage in ailia SDK

SentenceTransformer can be used with ailia SDK using the following command.

$ python3 sentence_transformer_japanese.py -i input.txt

It is possible for example to ask questions to the input text, calculate the distance between the embedding of the question and the embedding of the sentence, and output the closest text.

User (press q to exit): How fast nnapi?
Text: In fact, we have confirmed that the NNAPI NPU (int8) runs 15 times faster than the CPU (float) on Snapdragon 8+ Gen1 and yolox_tiny. (Similarity:0.592)

Because it is a multilingual model, it can also be queried in other languages, Japanese in the example below.

User (press q to exit): nnapiの速度
Text: In fact, we have confirmed that the NNAPI NPU (int8) runs 15 times faster than the CPU (float) on Snapdragon 8+ Gen1 and yolox_tiny. (Similarity:0.592)

ailia-models/natural_language_processing/sentence_transformers_japanese at master ·…

TEXT or PDF file. The sentence closest to the input prompt. This model requires additional module if you want to load…

github.com

ax Inc. has developed ailia SDK, which enables cross-platform, GPU-based rapid inference.

ax Inc. provides a wide range of services from consulting and model creation, to the development of AI-based applications and SDKs. Feel free to contact us for any inquiry.