Stories by David Klein on Medium

How I Connected Strava to ChatGPT to Get Real Post-Workout Analysis

David Klein — Tue, 02 Dec 2025 08:00:50 GMT

(For this setup ChatGPT Plus required)

Training apps are good at logging data, but they rarely tell you what it means.
Strava shows the ride. TrainingPeaks shows the numbers or long term plans. Zwift helps me hit my numbers. What I always missed was something simple:

“Tell me what actually happened in my workout, like a coach would.”

I used to work with a real trainer in the past, but with my limited time budget, family, work and travel, there is too much effort involved from both sides to make this really work, though I still don’t think you can really substitute a trained human coach, if you have the chance to work with one.

When OpenAI released custom GPTs with API connections, I realized I could streamline this part

This article explains how I built a post-workout analysis GPT, how it connects to Strava, and what you should and shouldn’t expect from it.

This is not a training plan generator.
It’s not a replacement for a long-term coach.
It’s a tool for interpreting single workouts extremely well.

Why I Wanted a Post-Workout GPT

After every ride I found myself asking:

Was my cadence stable or fading?
Did I pace intervals correctly?
Was there HR drift?
Was this SST session actually SST or something else?
Did fatigue show up?

The insights you get from strava are nice to have, but honestly, they don’t tell me much at least don’t answer these questions for me sufficiently.
As I said before: human coaches do, but they cost money and don’t check every single workout.

What I wanted was:

an immediate breakdown after each ride,
based on my Strava data: no screenshots,
with the logic of a trained endurance coach,
without trying to plan my whole season.

And that’s exactly what a custom GPT can do well.

The Limitations You Should Know Up Front

A GPT is not a season planner.
It has no long-term memory, cannot remember previous days, and cannot build continuity in the way a dedicated training platform or real coach can.

Yes, you could upload PDFs with traning philosophy, rider history, or block structures, but this is only static reference material. GPTs do not continuously remember your actual training unless you paste it into the chat every time.

That’s why I decided to split responsibilities:

Standard ChatGPT Chat → Spreadsheets / external notes → long-term planning
Custom GPT → per-workout interpretation

This keeps expectations realistic and the tool genuinely helpful (might write about long-term planning with a standard chat in the future, but honestly still haven’t fully figured out this part myself.)

Allowing the Strava API to Talk to ChatGPT

Before building any training logic or analysis prompts, the first thing you need is the Strava data itself. Fortunately, it’s now possible to let a custom GPT pull your activities directly from the Strava API — but only if you’re a ChatGPT Plus subscriber, because custom GPTs and Actions are part of the Plus tier.

Here’s a simple, reliable walkthrough of how to connect Strava to ChatGPT so your workouts can be analyzed automatically.

In Strava Webinterface: 1. Create a Strava API Application

In Strava, go to:

Settings → My API Application → Create App

Fill in the basic fields (name, website, etc.), and make sure you set:

Authorization Callback Domain:

chat.openai.com

This is required so Strava can send the OAuth approval back into ChatGPT.

Once the app is created, Strava provides:

Client ID
Client Secret

Keep those ready for later.

This is what it should look like, in my case German (sorry for that)

In ChatGPT: 2. Create a Custom GPT in ChatGPT

Inside ChatGPT:

GPTs → + Create

Give it a name and a short description.
You can refine the instructions later, the important part right now is the Strava connection.

3. Add a Strava Action

Scroll down in the GPT builder until you reach:

Actions → Create new action

This is where you configure the interface between ChatGPT and Strava.

After you create GPT, create the Action here

4. Configure OAuth (the authentication layer)

In the action settings:

Change Authentication: None → OAuth
Insert your Client ID and Client Secret from Strava
Then fill in these required fields:

Authorization URL:

https://www.strava.com/oauth/authorize

Token URL:

https://www.strava.com/api/v3/oauth/token

Scope:

read,read_all,activity:read,profile:read_all

This scope gives the GPT enough permission to:

read your activities
read your heart-rate, power, cadence, distance, etc.
read your profile and zones

Once done, click Save OAuth configuration.

In the end it should similar to this

5. Add the Strava API Schema

Now scroll down to the Schema field.

This tells ChatGPT which API endpoints it may call.
You don’t need the full Strava API, just a minimal, stable subset, but you don’t need to care, just copy and paste this into the schema part:

{
  "openapi": "3.1.0",
  "info": {
    "title": "Strava API",
    "description": "OpenAPI specification for accessing Strava fitness data via OAuth2.",
    "version": "v1.0.0"
  },
  "servers": [
    {
      "url": "https://www.strava.com/api/v3"
    }
  ],
  "paths": {
    "/athlete": {
      "get": {
        "operationId": "getAuthenticatedAthlete",
        "summary": "Get the currently authenticated athlete",
        "security": [
          {
            "stravaOAuth": [
              "profile:read_all"
            ]
          }
        ],
        "responses": {
          "200": {
            "description": "Returns the authenticated athlete",
            "content": {
              "application/json": {
                "schema": {
                  "$ref": "#/components/schemas/Athlete"
                }
              }
            }
          }
        }
      }
    },
    "/athlete/zones": {
      "get": {
        "operationId": "getAthleteZones",
        "summary": "Get athlete heart rate and power zones",
        "security": [
          {
            "stravaOAuth": [
              "profile:read_all"
            ]
          }
        ],
        "responses": {
          "200": {
            "description": "Athlete zones data",
            "content": {
              "application/json": {
                "schema": {
                  "type": "object",
                  "properties": {}
                }
              }
            }
          }
        }
      }
    },
    "/athletes/{id}/stats": {
      "get": {
        "operationId": "getAthleteStats",
        "summary": "Get aggregate athlete stats",
        "security": [
          {
            "stravaOAuth": [
              "profile:read_all"
            ]
          }
        ],
        "parameters": [
          {
            "name": "id",
            "in": "path",
            "required": true,
            "schema": {
              "type": "integer"
            }
          }
        ],
        "responses": {
          "200": {
            "description": "Aggregate stats",
            "content": {
              "application/json": {
                "schema": {
                  "type": "object",
                  "properties": {}
                }
              }
            }
          }
        }
      }
    },
    "/athlete/activities": {
      "get": {
        "operationId": "getAthleteActivities",
        "summary": "List activities of the authenticated athlete",
        "security": [
          {
            "stravaOAuth": [
              "activity:read_all"
            ]
          }
        ],
        "parameters": [
          {
            "name": "before",
            "in": "query",
            "schema": {
              "type": "integer"
            }
          },
          {
            "name": "after",
            "in": "query",
            "schema": {
              "type": "integer"
            }
          },
          {
            "name": "page",
            "in": "query",
            "schema": {
              "type": "integer"
            }
          },
          {
            "name": "per_page",
            "in": "query",
            "schema": {
              "type": "integer"
            }
          }
        ],
        "responses": {
          "200": {
            "description": "List of activities",
            "content": {
              "application/json": {
                "schema": {
                  "type": "array",
                  "items": {
                    "$ref": "#/components/schemas/ActivitySummary"
                  }
                }
              }
            }
          }
        }
      }
    },
    "/activities/{id}": {
      "get": {
        "operationId": "getActivityById",
        "summary": "Get activity details",
        "security": [
          {
            "stravaOAuth": [
              "activity:read_all"
            ]
          }
        ],
        "parameters": [
          {
            "name": "id",
            "in": "path",
            "required": true,
            "schema": {
              "type": "integer"
            }
          },
          {
            "name": "include_all_efforts",
            "in": "query",
            "schema": {
              "type": "boolean"
            }
          }
        ],
        "responses": {
          "200": {
            "description": "Activity details",
            "content": {
              "application/json": {
                "schema": {
                  "$ref": "#/components/schemas/ActivityDetail"
                }
              }
            }
          }
        }
      }
    },
    "/activities/{id}/streams": {
      "get": {
        "operationId": "getActivityStreams",
        "summary": "Get time-series streams (HR, Power, Cadence, GPS)",
        "security": [
          {
            "stravaOAuth": [
              "activity:read_all"
            ]
          }
        ],
        "parameters": [
          {
            "name": "id",
            "in": "path",
            "required": true,
            "schema": {
              "type": "integer"
            }
          },
          {
            "name": "keys",
            "in": "query",
            "schema": {
              "type": "string"
            }
          },
          {
            "name": "key_by_type",
            "in": "query",
            "schema": {
              "type": "boolean"
            }
          }
        ],
        "responses": {
          "200": {
            "description": "Activity streams",
            "content": {
              "application/json": {
                "schema": {
                  "type": "object",
                  "properties": {},
                  "additionalProperties": {
                    "type": "object",
                    "properties": {
                      "original_size": {
                        "type": "integer"
                      },
                      "resolution": {
                        "type": "string"
                      },
                      "series_type": {
                        "type": "string"
                      },
                      "data": {
                        "type": "array",
                        "items": {
                          "type": "number"
                        }
                      }
                    }
                  }
                }
              }
            }
          }
        }
      }
    },
    "/activities/{id}/zones": {
      "get": {
        "operationId": "getActivityZones",
        "summary": "Get activity zone distribution",
        "security": [
          {
            "stravaOAuth": [
              "activity:read_all"
            ]
          }
        ],
        "parameters": [
          {
            "name": "id",
            "in": "path",
            "required": true,
            "schema": {
              "type": "integer"
            }
          }
        ],
        "responses": {
          "200": {
            "description": "Zone distribution",
            "content": {
              "application/json": {
                "schema": {
                  "type": "object",
                  "properties": {}
                }
              }
            }
          }
        }
      }
    }
  },

  "components": {
    "schemas": {
      "Athlete": {
        "type": "object",
        "properties": {
          "id": {
            "type": "integer"
          },
          "username": {
            "type": "string"
          },
          "firstname": {
            "type": "string"
          },
          "lastname": {
            "type": "string"
          },
          "city": {
            "type": "string"
          },
          "country": {
            "type": "string"
          },
          "sex": {
            "type": "string"
          },
          "created_at": {
            "type": "string",
            "format": "date-time"
          }
        }
      },
      "ActivitySummary": {
        "type": "object",
        "properties": {
          "id": {
            "type": "integer"
          },
          "name": {
            "type": "string"
          },
          "distance": {
            "type": "number"
          },
          "moving_time": {
            "type": "integer"
          },
          "elapsed_time": {
            "type": "integer"
          },
          "type": {
            "type": "string"
          },
          "sport_type": {
            "type": "string"
          },
          "start_date": {
            "type": "string",
            "format": "date-time"
          },
          "visibility": {
            "type": "string",
            "enum": [
              "everyone",
              "followers_only",
              "only_me"
            ]
          }
        }
      },
      "ActivityDetail": {
        "type": "object",
        "properties": {
          "id": {
            "type": "integer"
          },
          "name": {
            "type": "string"
          },
          "description": {
            "type": "string"
          },
          "distance": {
            "type": "number"
          },
          "moving_time": {
            "type": "integer"
          },
          "elapsed_time": {
            "type": "integer"
          },
          "total_elevation_gain": {
            "type": "number"
          },
          "type": {
            "type": "string"
          },
          "sport_type": {
            "type": "string"
          },
          "start_date": {
            "type": "string",
            "format": "date-time"
          },
          "visibility": {
            "type": "string",
            "enum": [
              "everyone",
              "followers_only",
              "only_me"
            ]
          },
          "device_watts": {
            "type": "boolean"
          },
          "average_watts": {
            "type": "number"
          },
          "weighted_average_watts": {
            "type": "number"
          },
          "average_heartrate": {
            "type": "number"
          },
          "max_heartrate": {
            "type": "number"
          }
        }
      }
    },
    "securitySchemes": {
      "stravaOAuth": {
        "type": "oauth2",
        "flows": {
          "authorizationCode": {
            "authorizationUrl": "https://www.strava.com/oauth/authorize",
            "tokenUrl": "https://www.strava.com/oauth/token",
            "scopes": {
              "read": "Read public segments and routes",
              "read_all": "Read all segments and routes",
              "profile:read_all": "Read full athlete profile",
              "activity:read": "Read public activities",
              "activity:read_all": "Read all activities, including private",
              "activity:write": "Write activities"
            }
          }
        }
      }
    }
  },

  "security": [
    {
      "stravaOAuth": [
        "activity:read_all",
        "profile:read_all"
      ]
    }
  ]
}

Click Update.

If everything is correct, you’ll see no errors and your GPT can now pull activities from your Strava account.

If everything is filled in correctly there should be no errors and the Schema will allow certain actions to be taken

6. Add Your Custom Instructions (Important!)

This is where you give the GPT its “brain.”

Scroll back up to the Instructions section of your GPT. Here you define how the GPT should use the Strava data, e.g.:

what type of coach it is
what exactly you are interested in (HR Drift, Cardiac Lag, Heart Rate Recovery (HRR), Neuromuscular Fatigue, Power Decoupling…)
how it should interpret the activity streams
how deep the analysis should be
whether it should pull only certain sports
how to structure its output
what context about you (the athlete) should guide its decisions

This is the part where you tell it:

“If I say ‘judge my latest workout’, fetch my newest ride and analyze it deeply.”
“Always pull power, HR, cadence streams.”
“Use my private notes in Strava to interpret fatigue.”
“Give me coaching-style feedback, not summaries.”

These instructions are the core logic of the GPT , the API schema only lets it fetch data, the instructions tell it what to do with the data.

You can change and refine this text anytime.

My tip: I use private notes to tell it about the RPE, fatique, sleep quality, nutrition etc. so GPT gets the context of the session.

The Instructions I use personally (feel free to change and comment here what works for you). I just used ChatGPT to create this one.

You are a cycling coach.
Your ONLY job is to analyze single cycling workouts from Strava in depth.

When the user says things like:
- “judge my latest workout”
- “analyze my last ride”
- “deep dive on yesterday’s Zwift session”
- “go deeper on my endurance ride”
you MUST produce a full detailed analysis. Never keep it short unless the user explicitly asks for short.

------------------------------------------------------------
1. DATA ACCESS RULES
------------------------------------------------------------

Always fetch recent activities:
GET /athlete/activities?per_page=3

Activity selection logic:
- Use only activities where sport_type is one of:
  ["Ride", "VirtualRide", "GravelRide", "RoadRide", "EBikeRide"].
- If the user mentions a day (“yesterday”, “Friday”), pick the matching one.
- If the user mentions a name fragment (“Tempus Fugit”, “UNBOUND”), pick that one.
- Otherwise, pick the most recent cycling activity.
Never ask “public or private?” or “which ride?” if you can infer it.

Get full details:
GET /activities/{id}

Get streams:
GET /activities/{id}/streams
keys = "time,watts,heartrate,cadence,distance"
key_by_type = true

If stream call fails, state that only summary data is available and do a structured analysis anyway.

------------------------------------------------------------
2. ANALYSIS LOGIC
------------------------------------------------------------

Using activity details + streams, you MUST evaluate:

• Workout type:
  recovery, Z2 endurance, tempo, sweet spot, threshold, VO2max, long ride, race-like, mixed.

• Power:
  pacing consistency, variability index, interval identification, fade/no fade, execution vs intended zone.

• Heart rate:
  HR drift (first vs second half), interpretation of drift in context, HR vs power alignment (suppressed/elevated/normal).

• Cadence:
  average, stability, fatigue signals (falling cadence late), neuromuscular control.

• Fatigue markers:
  rising HR at same power, falling power at same HR, unstable cadence, poor efficiency.

• Context:
  Use description AND private notes.
  Private notes may contain:
  - sickness
  - bad sleep
  - high stress
  - poor fueling
  - dehydration
  - illness, fatigue
  ALWAYS integrate private notes into the interpretation.
  If private notes contradict physiological data, highlight the mismatch.

------------------------------------------------------------
3. EXECUTIVE SUMMARY (ALWAYS FIRST)
------------------------------------------------------------

Always start your answer with a 4–5 line performance judgment.
It must evaluate the athlete, not describe the workout.

The Executive Summary MUST clearly state:
- whether execution was strong / weak / overpaced
- what HR–power–cadence signals say about the body today
- whether efficiency was good or bad
- whether pacing was correct or too hard
- ONE actionable line for next time

Examples:

GOOD:
“Strong execution today. HR and power aligned well, drift low, cadence stable. Very efficient day. Same intensity next time.”

BAD:
“Low efficiency today. HR rose too fast for this power, drift high, cadence faded. Likely fatigue or underfueling. Reduce intensity next session.”

OVERPACED:
“You pushed above the intended zone. HR too high for given power and drift spiked. Good effort but too costly. Keep surges controlled next time.”

------------------------------------------------------------
4. FULL OUTPUT FORMAT
------------------------------------------------------------

After the Executive Summary, produce the deep-dive analysis:

1. Session Summary
   - name, date, duration, distance
   - avg power, NP, avg/max HR, avg cadence

2. Workout Type & Inferred Intent

3. Power Execution
   - pacing, interval quality, variability, effort discipline

4. Heart Rate & Aerobic Drift
   - numerical drift and interpretation

5. Cadence & Neuromuscular Aspects

6. Fatigue & Efficiency Indicators
   - HR–power alignment
   - durability
   - metabolic response
   - INCLUDE interpretation from private notes

7. Coach’s Verdict

8. Recommendation
   - specific next-session suggestion (Z2, SST, rest, cadence work, longer intervals, shorter recoveries, lower surges, etc.)

------------------------------------------------------------
5. GENERAL RULES
------------------------------------------------------------

• Always respond in the same language the user used last.  
• Never ask follow-up questions when the data is available.  
• This GPT NEVER handles multi-week planning.  
• This GPT ALWAYS does deep-dive analysis.  
• Provide specific performance judgments and next-session intensity suggestions.  
• Always read & interpret private notes if present.

7. Authenticate Once

After saving your GPT:

Open it once
It will show a Strava login window
Approve access

From now on, your GPT can pull your workouts automatically.

I generally use a simple prompt like “Judge my latest workout” after Zwift uploaded my ride to strava.

I will ask you to allow the action everytime or you just hit always allow.

That’s it: the Strava → ChatGPT bridge is done

Your GPT can now:

fetch activities
read power, cadence, HR streams
analyze trends
compare intervals
evaluate fatigue markers

Everything else (the training philosophy, the analysis depth, progressions, block planning, etc.) is controlled by the Instructions section from Step 6.

See my example here (this is just the first snippet with the management summary, it goes into much more detail, skipped for privacy reasons ;) )

Addition: API Limits, Long Workouts & How Streams Behave

Strava’s API has generous limits, but there are a few points worth knowing:

Very long activities have extremely long streams

A 6-hour ride can easily produce:

20,000–30,000 power points
the same for heart rate
the same for cadence

ChatGPT can load them but:

large arrays slow down the response
sometimes the Streams endpoint fails halfway
extremely large responses may trigger ChatGPT’s safety or memory limits

2. What usually fails first

Strava often returns:

Error: request returned too much data to process

That’s normal for:

> 3 hour indoor rides
> 4 hour outdoor rides

For me this doesn’t pose a huge challenge, as I’m mostly interested to judge my classic Zwift training sessions which for me are usually around 1,5–2,5 hrs.

But: You can control granularity

You can instruct your GPT to:

downsample streams (e.g., “use every 5th data point”)
truncate (e.g., “limit to first/last 90 minutes”)
summarize at segment level rather than second-by-second

Example instruction you can add:

“For rides longer than 3 hours, automatically downsample the time-series streams to 5-second or 10-second resolution before analysis.”

ChatGPT handles that very well.

4. When it’s better to analyze summary metrics

If the GPT struggles to fetch full streams, it can fall back to summary data:

NP
Avg HR
HRmax
HR drift based on laps
Power curves (summaries)
Suffer Score
Elevation
Average cadence

And still deliver a high-quality coaching interpretation. But obviously it won’t be as detailed as the point-by-point interpretation

5. You can add fallback instructions

Example:

“If streams fail, say so clearly and perform the best possible summary-based analysis instead.”

This prevents frustration and model confusion.

Have fun analyzing (and riding!)

Let Snowflake (Cortex) Auto-Doc Your Columns Like a Pro!

David Klein — Fri, 21 Feb 2025 08:59:07 GMT

Streamlining Database Documentation with Snowflake Cortex

Snowflake Cortex is changing how we handle data management tasks by harnessing the power of large language models (LLMs). There is endless applications of Cortex AI and LLMs out there today.

Today I want to focus a bit on a data management task; By using Snowflake Cortex, we can quickly generate meaningful, accurate descriptions for tables, views, and columns directly within Snowflake’s SQL-based environment. This helps reduce manual effort, improves data transparency, and enhances the discoverability of data, which is becoming increasingly important as LLMs enable users — traditionally distanced from the data — to access and leverage it more easily.

Generate Descriptions Easily in Snowsight UI

First of all, what I’m writing here might be obsolete in a few months — I definitely hope so! Because today already, you can generate descriptions directly from the Snowsight UI with just a few clicks. By selecting “Generate descriptions” in the Table/Columns View, Snowflake automatically analyses the metadata and, if desired, sample data to generate a concise description for each column, table, or view. The descriptions are then saved in the COMMENTproperty, which can be viewed in Table Details or the DESCRIBE TABLE output. This simple UI feature is a fast and intuitive way to ensure that your tables and columns are well-documented, reducing the need for manual descriptions.

“Generate descriptions” Button in Snowsight UI

Why Bulk Documentation Makes Sense for Large Datasets

In reality, managing thousands of tables in production means even simple tasks, like adding column descriptions, can quickly become time-consuming. Unfortunately, at the moment, we can’t automate this above-mentioned feature across multiple tables or bulk-edit it via code. Fear not! Snowflake Cortex steps in, allowing you to automate this task in bulk using SQL, making it easy to keep your database documentation up-to-date. That’s exactly why I’m outlining this workflow here.

By generating descriptions for all columns at once, you save time and ensure your database is consistently documented — no matter how large or complex the dataset. This scalability is especially important as data volumes continue to grow, making bulk documentation a key tool for maintaining an efficient, well-governed data ecosystem.

Example of a Workflow, Step-by-Step

Let’s dive into an example workflow. I’ll show you how to leverage Snowflake Cortex to create column descriptions, generate the corresponding ALTER statements, and then iterate through them to apply the changes to your database. This could later be wrapped in a stored procedure and a triggered task to automate the process easily.

Step 1: Create and Populate the Table I want to describe

-- Create the table
CREATE OR REPLACE TABLE ANALYTICS.PUBLIC.CUSTOMER_FINANCIALS (
    CUSTOMER_ID INT AUTOINCREMENT PRIMARY KEY,
    CUSTOMER_NAME STRING,
    INDUSTRY STRING,
    REVENUE_USD NUMBER(12,2),
    PROFIT_MARGIN FLOAT,
    EMPLOYEE_COUNT INT,
    COUNTRY STRING,
    LAST_PURCHASE_DATE DATE,
    CREDIT_RATING STRING,
    ACCOUNT_MANAGER STRING
);
-- Insert sample data
INSERT INTO ANALYTICS.PUBLIC.CUSTOMER_FINANCIALS (CUSTOMER_NAME, INDUSTRY, REVENUE_USD, PROFIT_MARGIN, EMPLOYEE_COUNT, COUNTRY, LAST_PURCHASE_DATE, CREDIT_RATING, ACCOUNT_MANAGER) 
VALUES
    ('Acme Corp', 'Manufacturing', 50000000.00, 12.5, 1200, 'USA', '2024-01-15', 'A', 'John Doe'),
    ('Beta Systems', 'Software', 25000000.00, 20.3, 500, 'Germany', '2024-02-10', 'B', 'Jane Smith'),
    ('Gamma Innovations', 'Healthcare', 12000000.00, 15.8, 200, 'France', '2023-12-05', 'A', 'Alice Brown'),
    ('Delta Dynamics', 'Retail', 78000000.00, 9.7, 3000, 'UK', '2024-03-01', 'C', 'Bob Johnson');
...

The table I’m using for this example workflow looks like this:

Step 2: (Optional) Check the Auto-Generated Descriptions for All Columns

With your table in place, the next step is to generate descriptions for all the columns and check if they makes sense at all — after all LLMs can hallucinate. Snowflake Cortex enables us to leverage its COMPLETE function, which can generate text descriptions based on a prompt. Here, I’ll use this function to automatically generate column descriptions for all columns in the CUSTOMER_FINANCIALS table. This step is optional of course, but I personally like to build these things out step-by-step to avoid complex debugging later on.

WITH ai_descriptions AS (
    SELECT 
        c.column_name,
        SNOWFLAKE.CORTEX.COMPLETE(
            'mistral-7b', 
            'Provide a concise description (≤10 words) for the following database column. 
            Do not repeat the column name or mention the data type. 
            Example: "Total amount paid", "Date of last transaction". Column: ' || c.column_name
        ) AS generated_description
    FROM information_schema.columns c
    WHERE c.table_schema = 'PUBLIC' 
    AND c.table_name = 'CUSTOMER_FINANCIALS'
)
SELECT * FROM ai_descriptions;

This query utilises the SNOWFLAKE.CORTEX.COMPLETE function to generate descriptions for every column in the table. You can see that the COMPLETE function takes a model (mistral-7b in this case) and a prompt that requests a concise description for each column — I personally found mistral-7b to work quite well and efficiently in this type of task. (Feel free to play around with the prompt and be as verbose as needed for the depth of detail for your description.) The results will include the column name and the generated description.

The descriptions that Cortex created for me look roughly like this:

Step 3: Create the Alter Statements

Once the descriptions are generated, you can use them to update the table’s column comments. Just as in Step 2, I utilise our prompt and Cortex to generate the descriptions and wrap them in the correct ALTER statements:

CREATE OR REPLACE TEMP TABLE temp_alter_statements AS
SELECT 
    'ALTER TABLE IDENTIFIER(''"ANALYTICS"."PUBLIC"."CUSTOMER_FINANCIALS"'' ) 
     ALTER COLUMN "' || column_name || '" COMMENT ' || 
     '''' || REPLACE(comment, '''', '''''') || '''' AS ALTER_SQL
FROM (
    SELECT 
        column_name,
        SNOWFLAKE.CORTEX.COMPLETE(
            'mistral-7b', 
            'Provide a concise description (≤10 words) for the following database column.
             Do not repeat the column name or mention the data type. 
             Example: "Total amount paid", "Date of last transaction". Column: ' || column_name
        ) AS comment
    FROM information_schema.columns 
    WHERE table_schema = 'PUBLIC' 
    AND table_name = 'CUSTOMER_FINANCIALS'
);

This query generates a temporary table with all the ALTER SQL statements that will update the column comments in the CUSTOMER_FINANCIALS table. The contents of the table will be simple statements, looking like this:

Step 4: Execute the Statements

In this final step, I use a cursor to iterate over the ALTER statements stored in temp_alter_statements. The cursor fetches each ALTER_SQL and executes it using EXECUTE IMMEDIATE. This process can be wrapped in a stored procedure, allowing for the bulk application of column descriptions efficiently with a single call.

DECLARE cur CURSOR FOR 
    SELECT ALTER_SQL FROM temp_alter_statements;
BEGIN
    FOR rec IN cur DO 
        EXECUTE IMMEDIATE rec.ALTER_SQL;
    END FOR;
END;

Last thing to note — I’m really not that deep into SQL, so there might be a more elegant way to achieve this or automate it further. Feel free to share your thoughts in the comments!

The Magic Behind Snowflake Cortex: Reducing Manual Work

By automating the generation and application of column descriptions, Snowflake Cortex drastically reduces the time and effort traditionally required for this task. What once required manual writing of descriptions or using external tools can now be done directly in SQL with Snowflake’s Cortex-powered functions.

The beauty of this solution is that it doesn’t require deep knowledge of machine learning models or Python scripting. Everything happens within the familiar SQL environment of Snowflake, making it accessible for anyone with SQL experience.

What’s Next for Snowflake Cortex?

As Snowflake Cortex continues to evolve, we can expect to see even more powerful tools for data management tasks. The ability to automate data documentation is just the beginning. In the future, Cortex will likely help simplify other data management tasks, making the Snowflake platform even more powerful and reducing the need for manual data operations. With Snowflake’s integration of LLMs, we’re witnessing a new era of AI-driven data management that promises to make complex workflows easier and more efficient than ever before.

Imagine a future where much of your data management tasks — from data cleaning to data enrichment and beyond — are automated by AI. Snowflake is making this vision a reality, and it’s just the beginning.

Transforming Lead Nurturing with Large Language Models in Snowflake Cortex

David Klein — Mon, 12 Feb 2024 06:18:00 GMT

In the realm of sales and marketing, managing leads effectively is paramount for business growth. Lead management encompasses the process of identifying potential customers, nurturing their interest, and guiding them through the sales funnel. However, this task often poses challenges due to the sheer volume of data and the need for personalized engagement.

Automation with advanced language models, such as Large Language Models (LLMs), offers a transformative solution to these challenges. By harnessing the power of LLMs, businesses can streamline lead management processes, automate repetitive tasks, and gain valuable insights from vast amounts of data.

LLMs excel at understanding and analyzing natural language, enabling them to extract meaningful information from lead interactions, classify prospects based on their interests and behaviors, and generate personalized responses. This automation not only enhances efficiency but also allows sales and marketing teams to focus on high-value activities, such as building relationships and closing deals.

This blog is going to explore the critical role of lead nurturing in sales and marketing and how automation powered by advanced language models, such as Large Language Models (LLMs), can revolutionize this process. We’ll discuss the challenges businesses encounter with lead management and how leveraging LLMs for automation can address these obstacles. Additionally, we’ll highlight the benefits of utilizing Snowflake’s LLM capabilities to enhance lead nurturing efforts. Stay tuned to discover how this innovative approach can drive efficiency, personalization, and growth in your sales pipeline.

Let’s imagine something like this, fully machine generated automatically sent to your prospects!

What are LLMs — for those who have missed the last year in tech industry headlines…

Large Language Models (LLMs) are advanced artificial intelligence systems designed to understand and generate human-like text based on vast amounts of training data. These models have revolutionized natural language processing tasks and are increasingly being applied across various domains. Here are some common use cases for LLMs:

Generative Tasks
Summarization
Rewriting
Searching
Question Answering
Clustering
Classification

Snowflake Cortex: Empowering SQL Users with AI Capabilities

Snowflake Cortex revolutionizes AI accessibility by bringing the power of generative AI and large language models (LLMs) to users of all backgrounds within the Snowflake platform. With a suite of serverless functions, Cortex enables seamless analytics and AI app development, offering access to specialized ML and LLM models tailored for specific tasks. By integrating these capabilities into Snowflake’s unified governance framework, Cortex ensures secure data management while enhancing user experiences with intuitive features like Snowflake Copilot and Universal Search. This groundbreaking service democratizes AI, empowering organizations to unlock dynamic insights from their data assets with ease.

The coolest thing about Cortex, especially for SQL users like myself who aren’t deeply into Python or have extensive backgrounds in Data Science or ML Engineering, is its complete reliance on SQL. The magic lies in the simplicity of SQL functions such as (snowflake.cortex.complete, snowflake.cortex.sentiment , snowflake.cortex.summarize and more), which I’ll demonstrate shortly.

The Scenario: Everyday CRM Data at NextGen Innovations

At NextGen Innovations (a fictitious company of course…), we specialize in providing advanced solutions for data analytics, AI, and cybersecurity. Our CRM system is filled with valuable data, including leads from companies across the globe. These leads, like those from Siemens AG in Germany, Airbus SE in France, and Microsoft Corporation in the United States, represent typical entries in CRM databases.

Snapshot of our CRM Data: Initial Lead Details in the table “leads”

In a traditional setup, our Business Development Representatives (BDRs) and Sales Development Representatives (SDRs) would utilize this CRM data for lead nurturing. They’d engage with leads based on their activities, such as downloading eBooks, attending events, or registering for webinars, to move them through the sales funnel.

For instance, if a lead like John Smith, a Data Analyst at Siemens AG, downloads an eBook on “Mastering Data Science,” our team would follow up with targeted communications tailored to his interests. This might involve sharing additional resources, scheduling personalized demos, or inviting him to relevant events.

The goal is to build relationships with leads, understand their needs, and guide them towards becoming qualified opportunities for our sales team. Traditional lead nurturing relies heavily on manual efforts to identify, prioritize, and engage with leads based on their behaviors and interests within the CRM system

Snowflake.Cortex.Complete: Leveraging General-Purpose LLMs in Snowflake

In the following section, we’ll explore the functionality of snowflake.cortex.complete within Snowflake Cortex and its implications for your lead nurturing efforts. Snowflake Cortex harnesses the power of Large Language Models (LLMs), with the foundational model being Llama, for various general-purpose tasks, including text completion and generation. While Llama is renowned for its prowess in generative tasks, it's essential to note that within Snowflake, Cortex is currently in a private preview phase (this being written in Feb 24).

Snowflake.cortex.complete allows users to generate text completions based on prompts provided, leveraging the capabilities of LLMs like Llama 7b or Llama 70b. However, it's crucial to understand that these models are not yet tailored to individual company data. As Cortex continues to evolve within Snowflake, future iterations may include fine-tuning LLMs on company-specific data, offering more precise and tailored outcomes for lead nurturing endeavors.

The cortex functions can be use as such:

SNOWFLAKE.CORTEX.COMPLETE(model, prompt) -- returns string

model: A string containing the name of the model to be used, either "llama2-70b-chat" or "llama2-7b-chat".
prompt: A string containing the prompt to be used to begin generating the response. In an interactive chat scenario, this would be something the user typed; here, it is a plain-English description of what you want.

Part 1: Using the LLM for Data Extraction, Enrichment & Classification

In this phase, we embark on a comprehensive process to extract, enrich, and classify data from our CRM systems. Our primary objectives are to assign an industry classification based on the company’s profile, enrich existing company information, and extract language profiles based on the company’s headquarters. Additionally, we aim to identify the area of focus for each prospect by analyzing their interactions with our company’s content and events. All this information will later on help us to prompt the LLM to automate a follow up email to our prospects.

1. Data Extraction: We begin by extracting pertinent data from our CRM systems, including lead information, company profiles, and interaction histories.

2. Data Enrichment: Once extracted, we enrich the data by supplementing it with additional information sourced from external databases and APIs. This may include details such as company size, revenue, industry trends, and market insights.

3. Classification: Utilizing the enriched data, we perform zero-shot classification to categorize leads based on various criteria like industry verticals, company size, and engagement levels. This segmentation enables tailored approaches to effectively nurture leads.

By combining data extraction, enrichment, and classification, we lay the groundwork for a highly personalized and effective lead nurturing strategy. This phase sets the stage for the subsequent generation of dynamic content and engagement strategies tailored to each lead’s unique profile and preferences.

We will use different approaches in the prompting for classification to show you different alternatives. You can either hardcode the classes you want the LLM to use in the prompt or use an additional SQL query to list ("listagg()") the categories/classes the LLM is supposed to base it’s decisions on. For the latter approach we will use this data in the table industries.

industries table (n=20), standard industry classifications

Ok, let’s get dirty and dive into the code for the above mentioned tasks. We’ll incorporate all of the tasks into one prompt to maximize the output with limited amount of tokens and interations. The LLM will be prompted to give us a JSON structure as response so we can easily parse it into new columns for subsequent use cases.

UPDATE leads
SET cortex_json = 
    SNOWFLAKE.CORTEX.COMPLETE(
        'llama2-70b-chat',
        CONCAT(
            'You are a classification and information extraction and enrichment bot.
            If you are unsure, return null. 
                For industry identify the companies industry chose exclusively from the list given below (choose only one),
                for profile single-sentence (maximum 12 words) business description
                for language extract the language in format "en" for english speaking countries, "de" , "fr" etc from the information of the country column you are given
                for area of focus try to match into the categories: Artifical Intelligence, Cybersecurity, Applications, Internet of Things (if you choose multiple write them in json array format)
            Respond exclusively in JSON format without any additional comments.

            {
            industry: ',
            (SELECT LISTAGG(industry_name, ', ') WITHIN GROUP (ORDER BY industry_name) 
             FROM industries),
            ',
            profile: 
            language:
            area_of_focus:

            }
        '
        , t.company, t.country
        , 'Return results'
        )
    ) 
FROM 
    leads t;

As you can see, the prompt includes the "listagg()" command to pull the information for industry classifications from a different table, while thearea_of_focus categories are hardcoded into the prompt, to give you an idea of various approaches possible. Similarly, we also allowed multiple values for the area of focus in order to see how well the LLM plays with arrays with multiple elements.

As always, the precision and quality of our responses hinge on the level of detail provided in the prompt. Therefore, we strive to be as prescriptive and comprehensive as possible in crafting our prompts to ensure optimal results, which look like this:

This is a preview of the ten first rows passed through our Cortex Function

 {
  "industry": "Manufacturing",
  "profile": "Airbus SE: A global leader in aeronautics, space, and related services.",
  "language": "fr",
  "area_of_focus": ["Artificial Intelligence", "Cybersecurity"]
}

Pretty cool I guess! The JSON has 4 fields: industry, company profile, language profile, and area of focus. The information comes from the general knowledge of our LLM. Based on the time it was trained, this might of course be a bit outdated, but for this use case it works quite well.

No we can easily parse this information into new columns if we like.

UPDATE leads
SET 
    industry = TRIM(PARSE_JSON(cortex_json):industry, '"'),
    profile = TRIM(PARSE_JSON(cortex_json):profile, '"'),
    language = TRIM(PARSE_JSON(cortex_json):language, '"'),
    area_of_focus = TRIM(PARSE_JSON(cortex_json):area_of_focus, '"');

Which leaves us with this structure. Certainly we wouldn’t need to create these new columns, but I always find it easier to work through this step-by-step while engineering this kind of workflow.

Part 2: Automating Personalized Engagement with Snowflake Cortex

In this phase, we harness the power of generative AI within Snowflake Cortex to craft personalized follow-up outreach to our prospects. While the first phase focused on data extraction, enrichment, and classification, this stage delves into content generation using the Llama 2 model.

Llama 2 stands out for its prowess in content creation, making it the ideal choice for crafting engaging and personalized follow-up communications. Fine-tuned models like Llama 2 have demonstrated their ability to expedite content generation across various platforms, from web content to social media posts and beyond. Moreover, Llama 2 can aid in brainstorming new ideas by suggesting keywords, topics, or formats tailored to our preferences.

By leveraging Llama 2 within Snowflake Cortex, we can automate the creation of dynamic and personalized follow-up outreach to our prospects. This enables us to engage with leads in a meaningful and tailored manner, increasing the likelihood of conversion and fostering stronger customer relationships. For this we will use the information we extracted, classified and enriched in the first part. For the second part we will again use Snowflake.cortex.complete with a new prompt in order to create our outreaches.

UPDATE leads
SET OUTREACH=
SNOWFLAKE.CORTEX.COMPLETE(
'llama2-70b-chat',
CONCAT(
'You are an email-writing robot. Your task is to compose an email to prospects of our company called NextGen Innovations. Address them by their name (', t.NAME, ')
They have shown interest in ', t.AREA_OF_FOCUS, '. The company',t.COMPANY,' is in the industry of ', t.INDUSTRY, ', and their major business is ', t.PROFILE, ' the person you are writing to works as',t.TITLE,'.
They have been engaging with our product via: ', t.LAST_INTEREST, ', the date they have been enagaging is: ', t.LAST_MARKETING_ACTIVITY, '.
However, do not use the exact date but refer to the relative to today',CURRENT_DATE,'such as ("I have seen you downloaded ... yesterday, a couple of days ago, a week ago, a couple of weeks ago, etc.").
Use all the information such as industry, their title, their areas of focus, the content they have been looking at to write a compelling email asking about typical challenges they would face in the industry and their job position and how our product would solve this.
Ask them if they would be interested in learning more, add a call to action, etc. End the email with greetings from the person who owns the contact, which is: ', t.LEAD_OWNER, ' and add their title after a comma, Business Development Associate.
Respond exclusively in JSON format without any additional comments.Be creative in writing style and feel free to be a bit challenging.

{
subject:
email_body:
}'

)
)
FROM
leads t;

Similar to step 1, we prompt Cortex to deliver a JSON output, but this time, the prompt is tailored specifically for crafting personalized email outreach. The prompt includes various columns from our previous step, such as the prospect’s name, company, industry, job title, areas of focus, and engagement history. Additionally, we include the current date to provide context for the LLM, as it does not inherently understand time. This ensures that the generated emails feel timely and relevant to the recipient.

The prompt itself is structured to guide the LLM in composing a natural and compelling email. It instructs the model to address the prospect by name, acknowledge their interest and engagement with our company, and tailor the email content based on their industry, job position, and areas of focus. The email prompts the prospect to consider typical challenges they face in their industry and role, highlighting how our product can address these challenges. A call to action is included to encourage further engagement, and the email concludes with greetings from the lead owner, adding a personal touch to the outreach.

Please note: Due to the nature of LLMs and their propensity for generating diverse responses, the output of the Cortex function may vary with each iteration. This means everytime we iterate through the code, the responses are a bit differently.

This leaves us (for example) with this output:

{"subject": "Revolutionizing Data Science with NextGen Innovations",
"email_body": "Dear Emma Johnson,

I hope this email finds you well. I came across your profile and noticed that you have been engaging with our product, specifically in the areas of Artificial Intelligence, Cybersecurity, and Applications. As the Head of Data Science at Airbus SE, I'm sure you're constantly looking for ways to stay ahead in the industry.

I wanted to reach out and ask if you've encountered any challenges in your role, such as managing large datasets, ensuring data security, or struggling to find the right tools to support your team's work? Our product has been designed to address these exact pain points, and I believe it could be a valuable addition to your toolkit.

Our AI-powered solutions have been developed to help data scientists like yourself streamline their workflows, improve data quality, and enhance decision-making capabilities. Additionally, our cybersecurity features ensure that your data is protected from unauthorized access and breaches.

I'd love to schedule a call to discuss how our product can support your team's efforts and help you overcome any obstacles you may be facing. Would you be interested in learning more?

Please let me know if you're available this week or next, and I'll make sure to schedule a time that works for you.

Best regards,
Sophia Müller, Business Development Associate

P.S. I noticed that you downloaded our whitepaper on AI and Machine Learning yesterday. Great timing! I'm excited to share more insights with you on how our product can help you achieve your goals."
}

I think the output isn’t quite that bad. Sure, a good BDR/SDR cannot be easily replaced by an LLM, as there’s a human touch and nuanced understanding that an AI model may lack. However, what’s cool about the output is how it captures the essence of a personalized outreach. It references specific actions taken by the prospect, such as downloading a whitepaper ‘yesterday,’ and addresses potential challenges they may face in their industry or role, even if at a high level. This level of detail adds authenticity to the communication and can help foster a deeper connection with the recipient. Might be enough to “nurture” that lead and get them engaged again.

Looking Ahead: Current Limitations and Outlook

Cool stuff in my opinion. But as always, it’s important to consider some limitations and potential future outlooks for the platform.

One current limitation is that the LLM model used in Cortex is not fine-tuned on company-specific data. While this may impact the specificity of generated content, it also highlights the platform’s versatility in handling diverse datasets. As Cortex continues to evolve, future iterations may include fine-tuning LLMs on company-specific data, offering more precise and tailored outcomes

Additionally, I’ve experimented with theSnowflake.cortex.translate function to regionalize outreach based on language extracted in previous steps. However, the current model quality may not suffice for highly personalized, coherent texts. Some text bits were translated weirdly, but honestly I didn’t go further down that road for long, might be something for a future blog post and further down the load the translate function. To Snowflake’s defense, remember these functions are all in Private Preview currently.

In considering alternative avenues, Snowflake’s introduction of Snowpark Container Services presents an intriguing prospect. This fully managed container offering facilitates the deployment, management, and scaling of containerized applications, hence LLMs, within the Snowflake platform. While this avenue would require self-hosting it in Snowflake, it offers a pathway to circumvent current limitations by providing greater flexibility and control over model deployment processes.

Conclusion

Snowflake Cortex LLM Functions democratize AI by making it accessible to everyone through the simplicity of SQL. This accessibility means that even those without extensive data science backgrounds can leverage powerful AI capabilities to drive business growth and innovation. Imagine the possibilities for automation in lead nurturing, marketing outreach, and beyond — it’s a game-changer in terms of efficiency and effectiveness.

Moreover, Snowflake’s platform ensures that these capabilities can scale effortlessly to handle vast amounts of data. With Snowflake’s built-in governance and security features, organizations can trust that their data remains protected while unlocking the full potential of Cortex LLM Functions. It’s not just about utilizing AI — it’s about doing so in a way that is scalable, secure, and aligned with the highest standards of governance.

Let’s automate the s***t out of stuff!