Transforming Data Science: Building a Topic Modelling App with Cohere and Databutton

Published in

Databutton

15 min readJun 11, 2023

Learn how to build and deploy a Cohere AI Generated Topic Modelling Application using Databutton, the all-in-one workspace designed to streamline the process of creating, deploying, and managing data apps.

Introduction

This tutorial builds upon the Topic Modelling App tutorial Topic Modeler (cohere.com) with the addition of Databutton’s data storage functionality will enable you to iterate through the clusters and use Cohere’s generate model to label and store the dataset.

Databutton comes with features like Pages, Jobs, Libraries, and Data Storage. Pages allow you to create multipage UIs for your users, Jobs enable scheduling of Python code, Libraries provide a place to write reusable code across your app, and Data Storage offers a simple put/get data store for various types of data.

Databutton also includes an AI assistant called Databutler, which is built on top of OpenAI and can help with code generation and problem-solving.

Deploying with Databutton enables you to use Streamlit functionality with additional backend and AI-assisted features, making it a more robust solution for larger or more complex projects.

Read this article for a more in-depth explanation of Databutton features: Beyond Streamlit: Databutton’s Revolution in AI App Development | by Elle Neal | Jun, 2023 | Medium

Let’s jump in!

Building a Topic Modelling Application using Cohere and Databutton

We are going to be building and deploying an AI powered Topic modelling application, here is a demo of the app for you to explore.

There are two options for following this guide, either create a new app in Databutton or ‘Fork this app’ directly to your Databutton account using this link.

Step 1: Load and Pre-process Data

Step 2: Cluster Data to Identify Groups

Step 3: AI Generated Labels

Step 4: Deploy and Share Application

Application Setup

Create a free account with Databutton
Create a free account with Cohere and get your API Key
Create a new app: once you have signed up for your free Databutton account, you can create a new app in seconds by clicking on ‘New app’
Add secrets: easily add and manage your API secrets use them anywhere within your app by calling ‘COHERE_API_KEY = db.secrets.get(name=”COHERE_API_KEY”’)
Install packages: Databutton enables you to easily install and manage packages for your application

cohere
scikit-learn
hdbscan
umap-learn
setuptools
plotly
matplotlib
datasets

Create multipage UI: adding Streamlit pages has never been easier. Databutton’s multipage UI enables users to simply click ‘+ New Page’.

Step 1: Load and Pre-process Data

Load Dataset, Generate & Reduce Embeddings, Save to Data Storage

During this stage, we separate this task for the user so they only need to generate the embeddings once. This process provides 2 outputs that are saved to Databutton’s Data Storage:

The embeddings as a json file
The reduced embeddings are appended to the dataframe

As the file is now saved to storage, the user can access the data throughout the application without having to perform this task again. There is no need to worry about caching the dataset for the user, with one line of code, you can return the data to the user anywhere within the application.

User Interface

The user can now either work with the preloaded file or upload their own, the pre-processing steps will run and the app will direct the user to the next steps.

Helper Functions

# Function to generate Cohere embeddings (time added for trial key)
def embed_text(texts):
    embeddings = []
    for i in range(0, len(texts), 90):
        batch = texts[i:i+90]
        output = co.embed(
            model="embed-english-v2.0",
            texts=batch)
        embeddings.extend(output.embeddings)
        time.sleep(60)
    return embeddings

# Function to reduce dimensionality of embeddings using umap
def reduce_dimensionality(embeddings):
    reducer = umap.UMAP()
    umap_embeddings = reducer.fit_transform(embeddings)
    return umap_embeddings[:, 0], umap_embeddings[:, 1]

# Function to save embeddings into a json file in Databutton data storage
def save_embeddings_to_json(df):
    # Create a dictionary where each key is the index of the DataFrame and each value is the corresponding embedding
    embeddings_dict = df['embedding'].to_dict()

Streamlit Code within Databutton IDE

The code below completes the following tasks:

Upload CSV file if not file is uploaded, use the sample dataset (MASSIVE dataset, which contains a list of commands that people give to their AI-based personal assistant (e.g., Alexa).
Generate text embeddings using Cohere, the embeddings will be reduced to 2 dimensions using UMAP (Uniform Manifold Approximation and Projection).

Learn about embeddings here: The Embed Endpoint (cohere.com)

The embeddings will be saved to Databutton storage as a json file
The reduced csv will be saved to Databutton’s Data Storage as a new file


from datasets import load_dataset
import umap.umap_ as umap
import databutton as db
import streamlit as st
import pandas as pd
import numpy as np
import cohere
import json
import time

# helper functions...

# 1. Upload CSV file if not file is uploaded, use the sample dataset
st.write("""
### 1. Upload CSV file if not file is uploaded, use the sample dataset
Load a dataset or use the pre-loaded MASSIVE dataset, which contains a list of commands 
that people give to their AI-based personal assistant (e.g., Alexa). This type of clustering 
exploration is similar to how a company would analyze incoming customer messages when designing a 
chatbot.""")

# Get (a small sample) the dataset
dataset = load_dataset("AmazonScience/massive", "en-US", split="train" )

# Convert dataset to pandas DataFrame
df = pd.DataFrame(dataset).sample(300)
df.reset_index(drop=True, inplace=True)
db.storage.dataframes.put(key="uploaded_file.csv", df=df)

# Or upload a file
uploaded_file = st.file_uploader("Choose a csv file", type="csv")
if uploaded_file is not None:
    df = pd.read_csv(uploaded_file)
    db.storage.dataframes.put(key="uploaded_file.csv", df=df)
    
st.dataframe(df)  
st.write("""
### 2. Generate and Reduce Embeddings then Save File
Generate text embeddings using Cohere, the embeddings will be reduced to 2 dimensions using umap
and a reduced csv will be saved to Databutton's Data Storage""")
if st.button ("Generate and Reduce Embeddings then Save File", type="primary"):
    
    # 2. Get Cohere embeddings
    df["embedding"] = embed_text(df["utt"].tolist())
    st.write("Embeddings Complete")
    
    # 3. Reduce dimensionality of embeddings using umap
    st.write("Reducing Dimensionality")
    umap_x, umap_y = reduce_dimensionality(df['embedding'].tolist())
    df['umap_x'] = umap_x
    df['umap_y'] = umap_y
    
    # 4. Save embeddings into a json file in Databutton data storage
    save_embeddings_to_json(df)
    
    # 5. Drop embeddings and save a new reduced file to Databutton's Data Storage
    df.drop('embedding', axis=1, inplace=True)
    db.storage.dataframes.put(key="reduced.csv", df=df)
    st.write("Reduced Dataset Saved to Data Storage")

Let’s take a look at our datasets…

Reduced embeddings

The reduced embeddings is now saved as a new file with the addition of umap_x and umap_y. We can copy the code snippet to import and use the dataframe anywhere in our application, here is an example of how you call the data: ‘df = db.storage.dataframes.get(key=”reduced.csv”)’

With Databutton, you can very quickly create a page for your user to explore, edit and add to the dataframe within their UI.

json Embeddings

We can now call the embeddings anywhere in our app with a line of code: data = db.storage.json.get(key=”embeddings.json”)

Step 2: Cluster Data to Identify Groups

This next stage in the workflow uses data science and machine learning for identify common patterns in the data, we seek to find clusters or groups in our data that share similar properties. This kind of analysis can help reveal patterns and structures in the data that might not be apparent otherwise.

User Interface

The user interface provides interactivity at crucial points to allow users to make decisions based on the results of initial analyses. Here are the step-by-step instructions:

Load the DataFrame: The script will load the DataFrame previously saved in Databutton’s storage.
Extract UMAP Coordinates: The script will extract the 2D UMAP coordinates that were previously computed. These coordinates are reduced representations of your data.

Learn about visualising your embeddings here: Visualizing Data (cohere.com)

Determine Optimal Number of Clusters: The script will compute the sum of squared errors (SSE) for a range of potential numbers of clusters. This is part of the process to use the KMeans algorithm, which requires specifying the number of clusters beforehand. The script will plot an elbow plot, which can be used to select the optimal number of clusters: look for the “elbow” in the curve where adding more clusters doesn’t significantly decrease SSE.
Create the Elbow Plot: The script will then visualize the SSE for different numbers of clusters as an elbow plot. This visualization will help you in choosing the optimal number of clusters.
User Selection of Clusters: You will select the number of clusters based on the elbow plot. The selection is done using a slider in the Streamlit app.
Run KMeans with User-selected Number of Clusters: With the chosen number of clusters, the script will run the KMeans algorithm, which classifies each point in your data into one of the clusters.

Learn about clustering embeddings here: Clustering Using Embeddings (cohere.com)

Plot the Clusters: The script will create a scatter plot of your data points, colored by their assigned cluster. This visualization gives you a spatial representation of how the algorithm has classified your data.
Calculate Cluster Centers and Distances from Centroids: The script calculates the centroid (or geometric center) of each cluster. It then computes how far each point in your data is from its cluster’s centroid. We will need this in the next step of the process where we label our clusters.
Display and Save the Results: The script will display the DataFrame that now includes cluster labels and distances from centroids. If you’re satisfied with the results, you can save this labelled data back into Databutton’s storage for later use.

import databutton as db
import streamlit as st
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
import matplotlib.pyplot as plt
import numpy as np
import plotly.express as px
from scipy.spatial import distance


# 1: Load dataframe
df = db.storage.dataframes.get(key="reduced.csv")

# 2: Extract UMAP coordinates
umapx = df['umap_x']
umapy = df['umap_y']
umap_coords = np.column_stack((umapx, umapy))

# 3: Define a range of potential clusters and compute SSE

clusters = range(2, 10)  # You may want to modify this range
sse = []
for k in clusters:
    kmeans = KMeans(n_clusters=k, random_state=0)
    kmeans.fit(umap_coords)
    sse.append(kmeans.inertia_)

# 4: Plot the elbow plot

fig, ax = plt.subplots(figsize=(10, 5))
plt.plot(clusters, sse, 'bx-')
plt.xlabel('k (number of clusters)')
plt.ylabel('Sum of Squared Errors (SSE)')
plt.title('Elbow Plot For Optimal Number of Clusters')
st.pyplot(fig)

# 5: User selects number of clusters based on elbow plot

n_clusters = st.slider('Number of Clusters', min_value=2, max_value=10, value=2)

# 6: Run KMeans with optimal number of clusters
kmeans_model = KMeans(n_clusters=n_clusters, random_state=0)
df['cluster_labels'] = kmeans_model.fit_predict(umap_coords)

# 7: Plotting the clusters

fig = px.scatter(df, x='umap_x', y='umap_y', color='cluster_labels', hover_data=['utt'])
st.plotly_chart(fig)

# 8: Calculate cluster centers and distances from centroids

centroids = df.groupby('cluster_labels')[['umap_x', 'umap_y']].mean().reset_index()
def calc_distance(row):
    centroid = centroids[centroids['cluster_labels'] == row['cluster_labels']]
    centroid_coords = (centroid['umap_x'].values[0], centroid['umap_y'].values[0])
    row_coords = (row['umap_x'], row['umap_y'])
    return distance.euclidean(row_coords, centroid_coords)
df['distance_from_centroid'] = df.apply(calc_distance, axis=1)

# 9: Display and save the results

selected_cluster = st.selectbox('Select a cluster label', df['cluster_labels'].unique())
temp_df = df[['utt', 'cluster_labels', 'distance_from_centroid']]
st.write(temp_df[temp_df['cluster_labels'] == selected_cluster])


if st.button("Save Labelled Data"):
    db.storage.dataframes.put(key="cluster.csv", df=df)

Step 3: AI Generated Labels

This code provides a user-friendly UI for your AI generated labels. The user can then save the labelled data for later use. The use of AI for generating initial labels and the option to revise these labels gives users a mix of automated and manual control over the labelling process.

User Interface

When the user clicks Generate AI Labels, the application will run the AI Generated labels function and display each cluster’s keywords, AI generated label and the list of utterances. The user can either keep the AI generated labels or through inspection amend the name.

The user is then presented with a scatterplot containing the generated labels and the ability to hover over the datapoints.
When the user is satisfied, they can then save the new dataset containing the labels to Databutton’s Data Storage.

Prompt Engineering

The utterance_prompt in this code is a predefined structured text that is used as an input for the AI model to generate descriptive labels for the data clusters. It's designed to instruct the AI model about the context of the data and what the AI is supposed to do.

In this specific application, the prompt is defined to mimic a situation where clusters of user commands (utterances) given to an AI-based personal assistant are being summarized. It instructs the AI that each cluster should be summarized by a list of keywords and a name that captures the common theme of the utterances in that cluster. The prompt also includes example clusters with sample utterances, keywords, and cluster names.

In a nutshell, the utterance_prompt is designed to help guide the AI model in its task, in this case, generating a concise and descriptive label for a cluster of user utterances based on a set of keywords. The AI model uses the structure and content of the prompt to understand the task and to generate appropriate labels for new data clusters.

Learn about prompt engineering here: Prompt Engineering (cohere.com)

utterance_prompt = """
These are clusters of commands given to an AI-based personal assistant. Each cluster represents a specific type of task or query that users often ask their personal assistant to perform. A list of keywords summarizing the collection is included, along with the name of the cluster. The name of each cluster should be a brief, precise description of the common theme within the utterances.
---
Cluster #0
Sample utterances from this cluster:
- status for the pizza delivery from pizza hut
- find and order rasgulla of janta sweet home pvt ltd
- i will be at pizza hut in ten minutes and will stay there for next forty minutes arrange an uber for me that can drop me home

Keywords for utterances in this cluster: pizza, delivery, uber, order
Cluster name: Food Delivery 

---
Cluster #1
Sample utterances from this cluster:
- show me where i can find a train
- can you show me the directions to go museum of flight in seattle
- please book train ticket to new york

Keywords for utterances in this cluster: train, directions, museum, book, ticket
Cluster name: Travel and Directions 

---
Cluster #2
Sample utterances from this cluster:
- get route for los angles from here
- nearest restaurants available at this time
- i want you to book a train ticket for me

Keywords for utterances in this cluster: route, los angeles, restaurants, time, book, train, ticket
Cluster name: Route Navigation and Reservations 

---
Cluster #3
Sample utterances from this cluster:
"""

Helper Functions

We create several helper functions for processing the dataframe, generating keywords, creating labels, and displaying information to the user. The extract_top_n_words function generates the most relevant keywords for each cluster. The generate_label function uses an AI model to generate a descriptive label for each cluster. The generate_keywords_and_label function wraps up these processes for each cluster and updates the dataframe accordingly. The present_cluster_data function is used to present the information about each cluster to the user.

# Function to generate AI labels for each cluster
def generate_label(utterance_prompt, text_series, retries=3):
    # Initialize Cohere model
    COHERE_API_KEY = db.secrets.get(name="COHERE_API_KEY")
    co = cohere.Client(COHERE_API_KEY)
    text_list = text_series.tolist()
    formatted_text_list = ""
    for text in text_list:
        formatted_text_list += "- " + text + "\n"
    try:
        response = co.generate(
            model="command-nightly",
            prompt=utterance_prompt + formatted_text_list,
            max_tokens=800,
            temperature=0.2,
            k=0,
            stop_sequences=[],
            return_likelihoods="NONE",
            truncate="END",
        )
        prompt = utterance_prompt + formatted_text_list
        return response.generations[0].text, prompt
    except Exception as e:
        time.sleep(30)
        if retries > 0:
            return generate_label(utterance_prompt, text_series, retries=retries - 1)
        st.error(e)

# Function to generate keywords for each cluster
def extract_top_n_words(vectorizer, tfidf_matrix, n=10):
    """
    Given a TfidfVectorizer and a TF-IDF matrix, return the `n` words with the highest TF-IDF scores.
    """

    # Sum tfidf frequency of each term through documents
    summed_tfidf = np.sum(tfidf_matrix, axis=0)

    # Connecting term to its sums frequency
    words_freq = [(word, summed_tfidf[0, idx]) for word, idx in vectorizer.vocabulary_.items()]
    words_freq = sorted(words_freq, key = lambda x: x[1], reverse=True)

    # Return the n words with highest tfidf
    return [word[0] for word in words_freq[:n]]

# Function to run the workflow for extracting keywords and AI generated labels
@st.cache_resource()
def generate_keywords_and_label(_df, cluster, utterance_prompt):
    df = _df
    # Filter the dataframe for each cluster
    df_cluster = df[df["cluster_labels"] == cluster].sort_values(
        by="distance_from_centroid", ascending=True
    )

    # Generate the TF-IDF matrix
    vectorizer = TfidfVectorizer(stop_words="english")
    tfidf_matrix = vectorizer.fit_transform(df_cluster["utt"])

    # Extract the top N keywords from each cluster
    keywords = extract_top_n_words(vectorizer, tfidf_matrix, n=10)

    # Generate a summary label using the AI model
    prompt = (
        utterance_prompt
        + "\nKeywords for messages in this cluster: "
        + ", ".join(keywords)
        + "\n"
    )
    summary, prompt = generate_label(prompt, df_cluster["utt"].head(n=5))

    # Extract cluster name from AI generated label
    start = summary.find("Cluster name:") + len("Cluster name:")
    end = summary.find("\n", start)
    cluster_name = summary[start:end].strip()

    # Update original dataframe with generated label and keywords
    df.loc[df["cluster_labels"] == cluster, "label"] = cluster_name
    df.loc[df["cluster_labels"] == cluster, "keywords"] = ", ".join(keywords)

    return df, keywords, cluster_name

# function to display each cluster to the user
def present_cluster_data(df, cluster, keywords, label):
    df_cluster = df[df["cluster_labels"] == cluster].sort_values(
        by="distance_from_centroid", ascending=True
    )

    st.markdown(f"**Cluster {cluster}**")
    st.markdown(f"**Generated Keywords:** {', '.join(keywords)}")
    st.markdown(f"**AI Proposed Label:** {label}")
    st.dataframe(df_cluster[["utt", "distance_from_centroid"]])

Data Loading: The dataframe is loaded from a Databutton storage with a key of "cluster.csv".
Cluster Processing: For each unique cluster in the dataframe, the generate_keywords_and_label function is called to generate relevant keywords and an AI generated label. These are added to the dataframe. Then, the present_cluster_data function is used to display this information to the user.

Learn about Topic Modelling here: Topic Modeling (cohere.com)

User Interactions: The user is given the option to rename the AI generated label for each cluster. If the user enters a new label, the dataframe is updated with this new label.
Saving Changes: Finally, the user can click a button to save their changes to the dataframe. When the “Save changes” button is clicked, the updated dataframe is saved back to the Databutton storage with a new key of "labeled_cluster.csv".

from sklearn.feature_extraction.text import TfidfVectorizer
import databutton as db
import streamlit as st
import pandas as pd
import cohere
import numpy as np
import time

# helper functions...

# Load Data
df = db.storage.dataframes.get(key="cluster.csv")

# Initialize an empty dictionary to hold cluster labels
cluster_labels = {}

# Define the TF-IDF vectorizer
vectorizer = TfidfVectorizer(stop_words="english")

clusters = df["cluster_labels"].unique()
if (
    st.button("Generate AI Labels", key="labelling", type="primary")
    or st.session_state.load_state
):
    st.session_state.load_state = True
    for cluster in clusters:
        df, keywords, label = generate_keywords_and_label(df, cluster, utterance_prompt)
        present_cluster_data(df, cluster, keywords, label)
        # Add user interaction to rename the label
        state_key = f"user_label_{cluster}"
        new_label = st.text_input(
            f"Enter a new label for cluster {cluster} (leave empty to keep the AI proposed label)",
            value=st.session_state.get(state_key, label),
            key=state_key,
        )
        if new_label != label:
            df.loc[df["cluster_labels"] == cluster, "label"] = new_label

    # For each cluster, find the utterance that is closest to the centroid
    for cluster in df["cluster_labels"].unique():
        min_distance_idx = df[df["cluster_labels"] == cluster][
            "distance_from_centroid"
        ].idxmin()
        df.loc[min_distance_idx, "closest_centroid_utt"] = df.loc[
            min_distance_idx, "utt"
        ]

    # Create the scatter plot
    fig = px.scatter(
        df, x="umap_x", y="umap_y", color="cluster_labels", hover_data=["utt", "label"]
    )

    # Add labels to the points that are closest to the centroid in each cluster
    for i in range(len(df)):
        if df.iloc[i]["utt"] == df.iloc[i]["closest_centroid_utt"]:
            fig.add_annotation(
                x=df.iloc[i]["umap_x"], y=df.iloc[i]["umap_y"], text=df.iloc[i]["label"]
            )

    # Display the plot
    st.plotly_chart(fig)


save = st.button("Save changes", type="primary")
if save:
    # Reset the input session states after save. Also reset the button state
    st.session_state.load_state = False
    for key in st.session_state.keys():
        if key.startswith('user_label_'):
            del st.session_state[key]
    db.storage.dataframes.put(key="labeled_cluster.csv", df=df)
    st.write("Labelled Data Saved")

Step 4: Deploy and Share Application

Now your app is ready to deploy and share with a couple of clicks.

Next Steps

The application can form part of a workflow where the user can then build and deploy a classification model using the AI generated labels.

Learn about classification models here: Classification Models (cohere.com)

Conclusion

This tutorial presented a comprehensive guide to building and deploying an AI-powered Topic Modelling application using Cohere and Databutton. We explored a few of the key features of Databutton, including Pages, and Data Storage, and showed how these can be leveraged to create a seamless and interactive user experience.

We built a topic modeling application that uses Cohere’s generate model to label and store data sets, a functionality further enhanced by Databutton’s data storage capabilities. Throughout the process, we highlighted the importance of user interactions and Databutton’s ability to facilitate these, allowing for optimal customization of data and results.

The walkthrough showcased how to conduct various data science processes, including loading and preprocessing data, clustering data to identify groups, and using AI to generate labels. It also highlighted the benefits of Databutton’s IDE in facilitating code implementation and its AI assistant, Databutler, in aiding code generation and problem-solving.

In the final stage, the tutorial demonstrated how to deploy and share the application, ready to be utilized as part of a broader workflow or standalone for topic modeling.

In conclusion, this guide proves the power and convenience of using Cohere and Databutton together for AI applications. The streamlined process, interactive UI, and the ease of managing data apps make this approach a robust solution for both beginner and seasoned data scientists and developers.

Transforming Data Science: Building a Topic Modelling App with Cohere and Databutton

Introduction

Building a Topic Modelling Application using Cohere and Databutton

Application Setup

Step 1: Load and Pre-process Data

User Interface

Helper Functions

Streamlit Code within Databutton IDE

json Embeddings

Step 2: Cluster Data to Identify Groups

User Interface

Step 3: AI Generated Labels

User Interface

Prompt Engineering

Helper Functions

Step 4: Deploy and Share Application

Next Steps

Conclusion

Written by Elle Neal