Automated Marketing Lead Screening Pipelines Using Snowflake Cortex LLMs

I lead the Marketing Data Science team at Snowflake and my team and I are responsible for the ML models that help our Marketing be insights-driven and to optimize our marketing campaigns. One of the most critical workstreams my team works on is our lead scoring model that our sales and marketing team uses to prioritize following up with our most valuable prospects. To set the stage, let me share a little more about how we source leads at Snowflake. By far our largest source of leads is our website, where people can register for in-person Snowflake marketing events like Data for Breakfast or access digital content like webinars, ebooks, and virtual hands-on labs. In the process of registering, we ask for a few pieces of information including a person’s name, job title, company, and email address. We then send the content or the registration confirmation to the email they’ve provided to us.

Example of a form on Snowflake’s website

Screening Lead Gen Forms

A key step for my team in this process is screening this data to prevent jokes or nonsensical data from appearing in front of our sales team and wasting their time. Until we gained access to Snowflake Cortex, the primary way we filtered these was through a manual effort of analyzing responses and building complex regular expressions to eliminate common types of spam that we would encounter.

What we’ve found is that regular expressions cannot keep up with people’s creativity on the internet. Recently we had a lead come through with the following data:

Michael Jackson is actually a common name in many places (we have many valid leads with that name in our marketing database!), but when you combine all the data points together it’s clear to a human looking at this lead that it’s probably not the King of Pop who’s interested in Snowflake. This is also the type of data that is very difficult to design any rule based system to filter out since it’s really the combination of data across several fields that gives it away. With this in mind we put together a prompt that has been very successful at screening leads for us. Here’s the prompt in action running in Snowflake:

Arriving at this prompt took our team some trial and error so that we could have the output structured in a way that downstream systems can use and to really reduce the false positive rate. For this use case, we really want to have a high bar for what we’re flagging as spam as we don’t want to lose out on the opportunity to reach out to someone interested in Snowflake. A great resource for us was this set of llama recipes which gave us the idea to use “Role Prompting” in our prompt and helped us with how we structured the data inputs.

Useful tip: Ask the model to wrap it’s response in xml tags. There are times when the reply from the LLM does not perfectly follow the guidance in our prompt. By asking that the response be wrapped in xml tags we can extract it and pass consistent data to our downstream jobs.

Cortex in our production dbt models

But running this one at a time is not a real solution to our problem so we worked to build our Snowflake Cortex LLM Function into our core data pipelines which are written using dbt. To keep ourselves efficient in our use of Snowflake credits we only will run our screening prompt on leads that have filled out a lead gen form in the past two weeks and we keep a log of everyone we’ve screened so that we don’t need to run the prompt a second time if all the inputs they give us are the same.

with recent_engagements as (
-- score people who have registered for a campaign in the past 14 days
select distinct
person_id
from marketing.core.dim_campaign_member
where campaign_member_first_associated_date between current_date - 14 and current_date
) , people_to_score as (
-- only score those are new or their data has changed since they were last scored
select
sp.id as person_id
, sp.first_name
, sp.last_name
, sp.title
, sp.company
from marketing.prep_person.salesforce_person_all as sp
left join marketing.prep_person.legitimate_persona as lp
on sp.id = lp.person_id
where sp.id in (select person_id from recent_engagements)
and (
lp.person_id is null
or sp.first_name != lp.first_name
or (sp.first_name is not null and lp.first_name is null)
or sp.last_name != lp.last_name
or (sp.last_name is not null and lp.last_name is null)
or coalesce(sp.title, '') != lp.clean_title
or (sp.title is not null and lp.title is null)
or coalesce(sp.company, '') != lp.clean_company
or (sp.company is not null and lp.clean_company is null)
)
)
select
p.person_id
, p.first_name
, p.last_name
, p.title
, p.company
, coalesce(p.first_name, '') as clean_first_name
, coalesce(p.last_name, '') as clean_last_name
, coalesce(p.title, '') as clean_title
, coalesce(p.company, '') as clean_company
, snowflake.cortex.complete('mistral-7b'
, [
{'role': 'system', 'content': 'You are a marketing expert working at Snowflake Inc. Your job is to evaluate the quality of leads who have expressed interest in Snowflake on a form available to anyone who visits the Snowflake website. '
|| 'Please classify if the data entered is one of these two categories: legitimate or spam. Consider each field on its own and in combination with the other fields. '
|| 'Here are some attributes of high quality leads: The job title should be one that would use or buy cloud software. The company name should appear to be a real organization. '
|| 'Here are some attributes of spam leads: The data contains references to famous people who would not be buying Snowflake. Any of the fields contain gibberish text or offensive content. Most fields contain just a single character or invalid punctuation. '
|| 'Wrap the category classification with an xml tag <response>.'
|| 'You should provide only "legitimate" or "spam" and no text other than the xml tag' },
{'role': 'user', 'content': '<data>'
|| 'Name: '
|| clean_first_name || ' ' ||clean_last_name
|| ', job title: '
|| clean_title
|| ', company: '
|| clean_company
|| '</data>'}
]
, {'max_tokens': 10}
):choices[0]:messages::string as spam_filter_raw
, regexp_substr(spam_filter_raw, '<response>(.*?)<\/response>', 1, 1, 'e', 1) as spam_filter
from people_to_score p

Conclusion

After a few weeks of running this process we’ve found that 70% of the leads that our Cortex-based solution has flagged as spam would have slipped past our traditional regex-based filtering. We’ve had sessions reviewing the outputs of the model with our internal stakeholders and found an extremely low false positive rate of otherwise legitimate leads being marked as spam. Being able to deploy Snowflake Cortex LLM Functions directly into our dbt models that our marketing data pipelines are already based on made this a very quick project for our team to deliver and provided immediate value to our internal stakeholders.

--

--