Custom Embedding Models from Hugging Face in Snowflake
Many of our customers are keen on harnessing the power of Snowflake to build their own Retrieval Augmented Generation (RAG) applications within Snowflake. Our platform is equipped with a robust suite of built-in functions, specifically designed to streamline this process. Among the most pivotal for these applications are EMBED_TEXT
, VECTOR_L2_DISTANCE
, and COMPLETE
.
Let’s delve deeper into the EMBED_TEXT
function and explore how they can be utilized effectively in your projects.
Simplifying Text Embeddings with EMBED_TEXT
The EMBED_TEXT
function is a straightforward tool that transforms text into embeddings, utilizing the e5-base-v2 model. This function is particularly useful for processing large volumes of text efficiently, allowing for advanced text analysis and machine learning applications. However, it's important to note that EMBED_TEXT
primarily supports English language texts, which might initially seem limiting if you're working with multilingual data. [EMBED_TEXT
is currently only available as PrPr]
Overcoming Language Barriers
Despite the language limitations of the standard EMBED_TEXT
function, there's no need to halt your projects involving non-English texts. Snowflake offers a flexible solution by allowing the integration of custom models from HuggingFace, a leading repository of state-of-the-art machine learning models. These custom models support a variety of languages, significantly broadening the scope of your data analytics capabilities.
Integrating Custom Models with Ease
Incorporating these models into Snowflake is straightforward, thanks to Snowflake’s model registry and it’s new feature of supporting sentence-transformers. This feature simplifies the management of custom NLP models, making it easy to implement them within your RAG applications. By utilizing the model registry, you can seamlessly enhance your applications with the capability to create embeddings for languages other than English.
Here’s a straightforward example of how to utilize a multilingual model from Huggingface that supports 100 different languages in Snowflake. This example demonstrates embedding German texts and then conducting retrieval based on vector distances:
# Create some test data to work with
ai_texts_german = [
"KI revolutioniert die Geschäftsanalytik, indem sie tiefere Einblicke in Daten bietet.",
"Unternehmen nutzen KI, um die Analyse und Interpretation komplexer Datensätze zu transformieren.",
"Mit KI können Unternehmen nun große Datenmengen verstehen, um die Entscheidungsfindung zu verbessern.",
"Künstliche Intelligenz ist ein Schlüsselwerkzeug für Unternehmen, die ihre Datenanalyse verbessern möchten.",
"Der Einsatz von KI in Unternehmen hilft dabei, bedeutungsvolle Informationen aus großen Datensätzen zu extrahieren."
]
different_texts_german = [
"Der große Weiße Hai ist einer der mächtigsten Raubtiere des Ozeans.",
"Van Goghs Sternennacht stellt die Aussicht aus seinem Zimmer in der Anstalt bei Nacht dar.",
"Quantencomputing könnte potenziell viele der derzeit verwendeten kryptografischen Systeme brechen.",
"Die brasilianische Küche ist bekannt für ihre Vielfalt und Reichhaltigkeit, beeinflusst von Europa, Afrika und den amerindischen Kulturen.",
"Das schnellste Landtier, der Gepard, erreicht Geschwindigkeiten von bis zu 120 km/h."
]
search_text = "Maschinelles Lernen ist eine unverzichtbare Ressource für Unternehmen, die ihre Dateneinblicke verbessern möchten."
df = session.create_dataframe(ai_texts_german+different_texts_german, schema=['TEXT'])
# Get the model registry object
from snowflake.ml.registry import Registry
reg = Registry(
session=session,
database_name=session.get_current_database(),
schema_name=session.get_current_schema()
)
# Get the embedding model from Huggingface
# Make sure it fits into a Snowflake warehouse and does not require GPUs
# Otherwise the model must deployed in Snowpark Container Services
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("intfloat/multilingual-e5-small")
# Register the model to Snowflake
snow_model = reg.log_model(
model,
model_name='multilingual_e5_small',
sample_input_data=df.limit(10)
)
# Create Embeddings from Huggingface Model
embedding_df = snow_model.run(df)
One of the standout features of Snowflake is its ability to automatically deduce the signature from the model along with the provided sample input data. While this offers a seamless setup, you also have the option to customize the signature according to your needs. In our example, we ensure that the model’s output column is named “EMBEDDING,” which streamlines the downstream processing steps.
from snowflake.ml.model.model_signature import FeatureSpec, DataType, ModelSignature
# In this example the output column will be called EMBEDDING
# and have a shape of (384,)
model_sig = ModelSignature(
inputs=[
FeatureSpec(dtype=DataType.STRING, name='TEXT')
],
outputs=[
FeatureSpec(dtype=DataType.DOUBLE, name='EMBEDDING', shape=(384,))
]
)
# Register the model to Snowflake
# "encode" is the model's function we want to call
snow_model_custom = reg.log_model(
model,
model_name='multilingual_e5_small_custom',
signatures={'encode':model_sig}
)
# Create Embeddings from Huggingface Model
embedding_df = snow_model_custom.run(df)
# We have to convert the output of the Huggingface model to Snowflake's Vector Datatype
embedding_df = embedding_df.with_column('EMBEDDING', F.col('EMBEDDING').cast(T.VectorType(float,384)))
embedding_df.write.save_as_table('EMBEDDED_TEXTS', mode='overwrite')
embedding_df = session.table('EMBEDDED_TEXTS')
embedding_df.show()
After creating the embeddings, we can immediately use them to retrieve similar texts based on a search query:
# Finally we can calculate the distance between all the embeddings
# and our search vector
closest_texts = embedding_df.with_column(
'VECTOR_DISTANCE',
F.vector_l2_distance(
F.col('EMBEDDING'),
F.call_builtin(
'MULTILINGUAL_E5_SMALL_CUSTOM!ENCODE',
F.lit(search_text))['EMBEDDING'].cast(T.VectorType(float,384))
)
).cache_result()
# As we can see, all of the closest texts are AI related like our search vector
closest_texts.order_by('VECTOR_DISTANCE').drop('EMBEDDING').show(max_width=100)
-----------------------------------------------------------------------------------------------------------------------------
|"TEXT" |"VECTOR_DISTANCE" |
-----------------------------------------------------------------------------------------------------------------------------
|Künstliche Intelligenz ist ein Schlüsselwerkzeug für Unternehmen, die ihre Datenanalyse verbesser... |0.5223915125274216 |
|KI revolutioniert die Geschäftsanalytik, indem sie tiefere Einblicke in Daten bietet. |0.5508320832298432 |
|Mit KI können Unternehmen nun große Datenmengen verstehen, um die Entscheidungsfindung zu verbess... |0.5517107107937466 |
|Der Einsatz von KI in Unternehmen hilft dabei, bedeutungsvolle Informationen aus großen Datensätz... |0.5768123622873043 |
|Unternehmen nutzen KI, um die Analyse und Interpretation komplexer Datensätze zu transformieren. |0.5782022682310389 |
|Quantencomputing könnte potenziell viele der derzeit verwendeten kryptografischen Systeme brechen. |0.6388733105787354 |
|Der große Weiße Hai ist einer der mächtigsten Raubtiere des Ozeans. |0.6887633247263519 |
|Die brasilianische Küche ist bekannt für ihre Vielfalt und Reichhaltigkeit, beeinflusst von Europ... |0.6959150254004067 |
|Das schnellste Landtier, der Gepard, erreicht Geschwindigkeiten von bis zu 120 km/h. |0.7112655721076001 |
|Van Goghs Sternennacht stellt die Aussicht aus seinem Zimmer in der Anstalt bei Nacht dar. |0.7352039852718184 |
-----------------------------------------------------------------------------------------------------------------------------
Final Note
This method implements the model as a User-Defined Function (UDF) that runs on standard Snowflake Virtual Warehouses. These compute nodes lack GPUs, which can result in longer inference times for larger models. To speed up inference, you can opt to deploy these models in Snowpark Container Services, which support GPU acceleration. Please note that this deployment feature is still in development.
As always, all the code is located in this Github repository.
Michael Gorkow | Field CTO Datascience @ Snowflake