Enrich your user analytics using LLM Cortex Functions!

We are entering a new AI era where with the right tools, a single analyst can perform the work of an entire data analytics provider — natively within Snowflake, thanks to its AI platform called Cortex. I still remember my analyst days, when I’d pour over customer usage patterns to get a better understanding of how they used our service. The task was expensive and tedious, and often required investing in third party vendors for market intelligence. Now, however, Large Language Models (LLMs), particularly through Cortex LLM Functions, provide instant access to a huge information extraction and enrichment capabilities. These functions combine the power of an LLM with the simplicity of a SQL function, meaning analysts can focus on producing insights that help drive business growth. Combine LLMs with structured data processing, and voila — entire workflows from my previous career can be implemented in a few lines of SQL! Let’s see how.

Industry segmentation, made easy

Consider this simple but representative market intelligence use-case. Suppose your customer usage data includes the email domain associated with your customer’s company. Imagine the input table we have are a list of email domains:

Now as they say, to build a lasting business, know your customer. Let’s try to get additional information about our customers’ businesses. The table we want to produce includes the company name, business description, industry, number of employees, and the country headquarters for each email domain.

Choosing the right tool for the job

Typically, this kind of task would require access to a large external database of business facts. Today however, LLMs inherently “know” much of this type of information, and can be accessed nearly instantaneously from Cortex, Snowflake’s AI platform. For the task at hand, we recommend using the CORTEX.COMPLETE ( ) SQL function because it is purpose-built to power data processing and data generation tasks. The beauty of CORTEX.COMPLETE ( ) and all of the CORTEX LLM functions (currently in private preview) is that they are accessible within your Snowflake account, which means that you don’t have to learn a new tool or establish a new process. With the CORTEX.COMPLETE ( ) function, you have the ability to choose which LLM to use. For complex data generation tasks like this we are going to pick the Llama-2–70B-Chat, a very capable open source LLM created by Meta.

Define the Job to Be Done

We are now ready to provide our CORTEX.COMPLETE ( ) functions with instructions on what segmentation information we want it to produce given a list of email domains. The instructions we provide LLMs are often referred to as “Prompts”, and they are used to guide the models to generate an appropriate response or output given our instructions. Think about it as giving your LLM a set of business requirements!

You can see that the business requirements we input in the SQL statement below are simple, clear, and complete instructions. For example, we can explicitly tell our LLM its job is to classify email domains and what type of classification we are expecting. This is important because classification alone is a broad term, which means that an ambiguous instruction will likely result in a response spanning many types of categorizations. Furthermore, we can define the constraints of the desired output. Specifically, we explicitly tell it we expect it to respond in JSON format and the fields we expect to see. To learn more about “prompting” we recommend looking at this guide.

SELECT SNOWFLAKE.CORTEX.COMPLETE(
'llama2-70b-chat',
CONCAT('You are a helpful data assistant. Your job is to classify email domains. If you are unsure, return null. For a company email domain, identify the company\'name, industry, number of employees based on newest information, country of headquaters, single sentence business description. Respond exclusively in JSON format.

{
email_domain:
company_name:
business_description:
country_headquaters:
industry:
number_employees:

}
'
, name
, 'Return results'
)
) FROM email_domain_list

Execute

Here’s an example of the results and they look promising! Now, we know that Zalando.de is a German e-commerce company, that its in the E-Commerce business, and that its a large organization with over 10,000 employees (as of 2022).

 {
"email_domain": "zalando.de",
"company_name": "Zalando",
"business_description": "Zalando is a German e-commerce company that specializes in fashion and lifestyle products.",
"country_headquarters": "Germany",
"industry": "E-commerce",
"number_employees": "10,000+ (as of 2022)"
}

To create the final table we can easily query and run analytics on, we need to first transform the JSON formatted string the CORTEX.COMPLETE ( ) function returns into JSON. Yes, the CORTEX.COMPLETE ( ) returns a string not an actual JSON file. To convert the string into JSON, Snowflake offers native functions for JSON, in our case we will use the PARSE_JSON function. Once the values are in JSON, we can easily query the JSON files to construct the table we want.

CREATE TABLE FINAL_TABLE AS (
WITH parse_data as (
SELECT
PARSE_JSON(json_output) AS parse_data
FROM output )

SELECT

TRIM(parse_data:email_domain,'""') as email_domain,
TRIM(parse_data:company_name,'""')as company_name,
TRIM(parse_data:business_description, '""') as business_description,
TRIM(parse_data:industry,'""') as industry,
TRIM(parse_data:number_employees,'""') as number_employees,
TRIM(parse_data:country_headquarters,'""') as country_headquarters

FROM parse_data;)


SELECT * FROM FINAL_TABLE

Here’s an example of the final result!

Conclusion

I love that Cortex LLM Functions combine the power of an LLM with the simplicity of a SQL function because it means analysts can focus on producing insights that help drive business growth. This is a humble yet powerful example of how using LLMs are changing the way that we work, and am excited to hear how Cortex LLM Functions are helping you transform your business.

Stay tuned, for our upcoming segment on how we can use the CORTEX.COMPLETE ( ) to help us produce features used to build better predictive models.

--

--