Using Snowflake for AI completion in Manychat

Ready to make your chats smarter and your users happier? This was one of our goals at Manychat — the leading chat marketing platform for Instagram, Messenger, and WhatsApp — trusted by over 1 million businesses, and powering more than 4 billion conversations. And now, we’re integrating AI completion features to fuel chat automation and increase user engagement.

Building an AI-Powered Data Platform at Manychat

Continuous innovation is fundamental to how we approach development at Manychat. We have built a great data platform, which allows us to design and refine numerous new AI-based features in a data-driven way, such as Intention Recognition (an advanced trigger that understands the intention behind a user’s message, unlike traditional keyword triggers) and Text Improver (a feature that enhances your message’s clarity and effectiveness by providing suggestions on tone, grammar, and phrasing).

To implement these features, we are actively using LLMs with chat completion capabilities. We started with OpenAI API and Azure OpenAI Service using GPT-3.4, GPT-4, and GPT-4o. However, we have a lot of reasons to switch to hosted open-source LLMs like Mixtral and Meta LLama 3, including the need for more flexibility with models and independence from a single vendor.

Main reasons, why we started thinking about hosting LLMs on our own:

  1. Business continuity. Although OpenAI LLMs are great, they are prone to changes. Imagine, that you’ve tuned your prompt for GPT 3 Darwin LLM, but, in suddenly, the model becomes deprecated, and you need to search for a replacement — fast. With open source LLM you can freeze versions and forget about these risks.
  2. PI protection. Sending user requests to other companies is an additional risk, even if all agreements are signed. If you host open-source LLM internally, your company is certain about its ability to protect, cleanse, obfuscate PI, and reliably remove data, if a client sends a request to forget them completely.
  3. Performance. Open source LLMs can be fine tuned to answer some types of questions quicker, without affecting quality of answers. So, if a company launches LLMs internally, it can achieve significantly faster delay for a single call.

Based on the three reasons above, we decided to try using hosted LLMs for the Intention Recognition feature of Manychat, mentioned above.

But how can we host LLMs in production?

Hosting Open-Source LLMs with Snowflake

As our main datastore, we use Snowflake, a scalable, cloud-based platform that integrates data lakes and data warehouses. There are two options in Snowflake for hosting open source LLMs:

  • Snowpark Container Services provide a fully managed container with easy and optimized integration with data from Snowflake, allowing us to host a simple server app that serves the model using GPU.
  • Cortex AI provides out-of-the-box open-source LLMs that are already hosted and fully managed by Snowflake.

Evaluating Performance: Snowpark Container Services vs. Cortex AI

It was important for us to understand if we could use these services for our AI tasks. The primary concern was not model accuracy (modern open-source LLMs are comparable in accuracy with OpenAI models) but performance: would Snowpark Container Services and Cortex be fast enough for us?

Performance Results with Snowpark Container Services

We developed a simple model serving app using Flask + Hugging Face and deployed it to Snowpark Container Services. We created a user-defined function that scores the model with the same SQL interface as using a model from Cortex.

We used the smallest Compute Pool with GPU: GPU | S (GPU: NVIDIA A10G, 24 GB, RAM: 27GB). Its size is perfect for using smaller LLMs — like Mistral 7B, and Meta Llama 3 8B — and we decided to use the latter.

In our experiment, we queried the model every few seconds with typical text input and 20 tokens to complete. The goal was to estimate the average scoring time and check the reliability of services. The average scoring time of models on Snowpark Container Services is 3.1s and it has pretty big noise.

Performance Results with Cortex AI

We carried out a similar experiment using Cortex AI. The best thing about Cortex is that for the simplest case you can use it by just querying:

SELECT SNOWFLAKE.CORTEX.COMPLETE(
'llama3-8b', 'What are large language models?'
);

And that’s it, no additional work is needed. So, we carried out the same experiment as with Snowpark Container Services. The result was great: 0.2s.

Conclusion: Optimizing AI Completion with Snowflake

We confirmed that both methods of hosting open-source LLMs in Snowflake work perfectly providing acceptable response time and reliability. However, Cortex works faster and more optimized, while Snowpark Container Services offers greater flexibility and potential for model serving optimization.

How do we plan to use Cortex in production?

Here is the basic illustration (below). We’ll hide all complexities of hosted LLM from the main backend. Instead, the backend will just call an API endpoint of a python-based microservice, which will then call a model from Cortex.

As we continue to push boundaries of what’s possible with AI in the Manychat application, we will explore and document the capabilities and performance of LLMs in various scenarios. We aim to investigate more precise characteristics of these LLMs, focusing on accuracy and other metrics.

We also plan to share our amazing results with multilingual embeddings in Snowpark Container Services for hundreds of millions of texts, as well as our progress in finetuning, and in practical improving business metrics based on AI/LLM and all this stuff.

Stay tuned — we’ll cover these (and more!) in our upcoming articles. You can also follow us on LinkedIn and Instagram for the latest!

--

--