Summarizing Topics in Higher Ed Survey Data with LLMs

11 min readJul 2, 2024

This is a post in a series on how pre-trained language models (PLMs) have begun to transform work done in the context of higher education. I’ve shared details about the project here.

Universities have some of the richest text data of any institutions: application essays, student writing housed in online learning management systems, student surveys, teaching evaluation comments, etc.

Unlike open text data mined online, too, much of this data is very high quality, curated by professionals. Well-designed, methodologically-sound student surveys have collected both structured and unstructured data from students throughout their careers (first years, graduates, alumni), often for decades. On these surveys, in addition to all the Likert-scale questions and activity-selections, students usually have one or more opportunities to answer open text questions. Historically, this open text data has received less attention, simply because of how difficult it is to manually process: small offices rarely possess the resources to devote staff to wading through and marking up thousands of student survey responses, and many universities thus have a valuable data asset sitting somewhere in storage relatively unused.

In this post, we’ll look at one of the most fundamental questions we can ask of unstructured text: what is all this about? What are the main themes or topics that have come up in student responses, and how have these changed over time or by demographic group? LLMs once again present a quick, powerful tool for this exploration, allowing us to label our text data in ways that offer us a holistic, high-level understanding of our student survey responses.

We will use BERTopic, a pre-built package that divides topic modeling into five modular steps and combines traditional document clustering algorithms with pre-trained language models. Usually, this method will be more efficient and give better results than simply pasting surveys into a generative model and asking for a summary.

Since student data is protected by FERPA and cannot be shared, I’ve created a dummy dataset with Chat GPT, asking it to generate one hundred example survey responses and saving the result as a CSV file. Here is the prompt I used:

You are a graduating senior answering a survey question from your university: how would you improve your undergraduate experience? Generate one hundred example answers, covering multiple topics commonly important to undergraduates. Topics can overlap, and you should have multiple answers associated with each topic. Answers should be between fifty and one hundred and fifty words, and they should reflect personal experiences.

Topic Modeling

Like so many tasks now done by LLMs, topic modeling has long existed in NLP. Before we begin coding, I thought it might be helpful to offer a very brief overview of the theory. Topic modeling is an unsupervised method for clustering texts, meaning that we can use it to sort documents that we have not yet processed into groups; we do not need any training data. (If, on the other hand, we already had labels, and we wanted to apply these labels to a new year of survey data, this would be “supervised” text classification, which we will see next time.)

Traditionally, topic models look at the distribution of words over a corpus, or set of documents (in this case, our survey responses), learning words associated with a set number of clusters. Topic modeling algorithms will not tell us the number of clusters — that has to be provided, and it usually requires experimentation and both qualitative and quantitative analysis to find the ideal number of clusters, a process Jonathan Chang has referred to as “reading tea leaves.”

The output of topic modeling is twofold: first, a list of words associated with each topic; and next, a list of the percent each document falls into a topic. Therefore, we do not get a “theme” or topic directly, but rather a list of words and percentages associated with a topic; we’ll look at using a generative model to transform this into a theme in the second part of this post. Since this is a mixed membership model, we also assume every document has at least some relevance to almost every topic: maybe a student is talking a bit about both facilities and faculty, for example.

As a side node, in my previous field of English/Digital Humanities, we used topic modeling quite often. In the early twenty-first century, topic modeling became a trend, as we started exploring whether computers might be able to tell us, for example, the main themes of Shakespeare’s MacBeth, or the main topics in seventeenth century novels.

We will be using the base BERTopic, though other flavors of this model are available on Huggingface. In fact, maybe someday someone will create a model designed for working with survey data or educational data? However, we’ve found that the base model works quite well, a quick, secure, low-resource solution that offers us a high-level overview of the main topics discussed by students in their survey responses.

A full description of how BERTopic works is available from its creators here . As a quick summary, this model first uses pre-built sentence transformer models to convert text to numbers, reduces the dimensions of the data through various dimensionality reduction techniques, clusters the documents using HDBSCAN, employs a bag of words approach on each cluster to find representative words, and then fine tunes those representative words through various representation models.

Their guide is excellent, and this post simply explores how we can use the flexibility inherent to the model to get the best results for educational survey data.

Once again, we can create a basic model with only a few lines of code.

from bertopic import BERTopic
import pandas as pd

#This is a dataset of fake student survey responses generated with ChatGPT
surveys = pd.read_excel("fake_student_survey_data.xlsx")
docs = surveys['text']

#An example of the first response
print(docs[0])

#Here is a base model with all the hyperparameters left in place
topic_model = BERTopic()
topics, probs = topic_model.fit_transform(docs)

topic_model.get_topic_info()

#Let's take a look at one of the topics:
topic_model.get_topic(0)

>>[('and', 0.1088069298742026),
 ('campus', 0.08619195218903308),
 ('was', 0.08086341588854194),
 ('community', 0.07326148866137135),
 ('for', 0.07074522534467909),
 ('that', 0.06299987817928654),
 ('students', 0.061176607040223686),
 ('cultural', 0.06106494569875059),
 ('my', 0.05898717873257478),
 ('diverse', 0.0587606564452443)]

At least for our fake survey data, BERTopic doesn’t produce great results if we do not adjust any hyperparameters: only three topics, and they’re all dominated by rather meaningless words (like ‘was’ or ‘and’ — traditional stop words). However, BERTopic offers a number of ways that we can improve this performance.

First, BERTopic, like all NLP models, converts text to numbers, and we can choose from any available sentence transformer on Huggingface to create our embeddings. Researchers are updating and improving these models constantly, and you can find the current leaderboard here. In general, for this task, we probably want to balance effectiveness and model size, especially if our corpus is large. The base model is all-MiniLM-L6-v2, and I’ve substituted a slightly newer model, ‘gte-small’. We also tried using a model based on distilBERT for feature extraction, and all three produced solid results.

from sentence_transformers import SentenceTransformer
sentence_model = SentenceTransformer("thenlper/gte-small")

Perhaps most simply, we can remove stop words with the count vectorizer. If useful, we could also explore the ngram_range parameter, which allows us to consider phrases in addition to words when clustering the survey responses. In practice with surveys, we found leaving ngram_range at 1 produced the most readable results.

from sklearn.feature_extraction.text import CountVectorizer

#remove stop words
vectorizer_model = CountVectorizer(stop_words="english")

Next, BERTopic uses a clustering algorithm called HDBSCAN. We can experiment with available hyperparameters, reducing the minimum clustering size or changing the distance metric to produce better results. Since we want more than the three initial topics, I’ve lowered the min_cluster_size, and left the other parameters in place.

from hdbscan import HDBSCAN

hdbscan_model = HDBSCAN(min_cluster_size=3, metric='euclidean', cluster_selection_method='eom', prediction_data=True)

Finally, we can use various representation models for refining our results. This can be relatively simple, such as the built in KeyBERTInspired model, used below. This uses our embedding model to compare documents in each topic associated with candidate topic key words, updating and fine-tuning the keywords to find the best match. As we’ll see below, we can also add generative models to the pipeline to help create and refine topics and labels.

from bertopic.representation import KeyBERTInspired

# Create your representation model
#KeyBERTInspired is one of the pre-packaged representation models. It does some fine tuning of the results
representation_model = KeyBERTInspired()

With these updates, we now have a model that produces sensible, recognizable topics:

BERTopic also has built in the ability to visualize our results. For example, we might want to explore the topics themselves, perhaps visualizing the most important keywords.

#We can visualize our terms to understand the topics a bit better
topic_model.visualize_barchart()

Or, more interestingly, we might like to explore these results across other fields in our data, checking to see how survey results have changed over time, for example, or how they are distributed across various demographic or academic groups. I’ve added some fake demographic groups to our fake survey results, and we can check on the distribution to see which groups focus on which topics:

#Create random groups to show distribution over group, for example, gender, if we have demographic data
import random
n = len(surveys)
group = [random.randint(0, 1) for _ in range(n)]

# Visualize topics per this randomly created group
surveys['group'] = group
surveys['group_label'] = surveys['group'].apply(lambda x: 'Male' if x == 0 else 'Female')
classes = surveys['group_label']

topics_per_class = topic_model.topics_per_class(docs, classes=classes)
topic_model.visualize_topics_per_class(topics_per_class)

Finally, we can explore using generative models to facilitate this process. As mentioned above, the result of a topic model is not a label or topic, but rather a matrix of words that are associated with the topic/cluster. BERTopic allows us to add a generative model to our pipeline as our representation_model, the last step in the process, and feed these keywords (and the most representative documents) of each cluster to the model, asking it to assign a single label. This is a step that again allows for flexibility and creativity, as we can experiment both with the thousands of text generation models available on Huggingface and the default prompt we want to feed to the model.

I updated the suggested prompt to clarify the context — educational survey data — and specify the desired labels, experimenting with the language a bit to try to get readable, sensible labels that fit the example data.

For this task, we might turn to available small(ish) language models on Huggingface, since summarizing a few keywords is a relatively straightforward task. We can find a low bit open source LLM leaderboard, maintained by Intel, here.

We explored four possibilities. First, we found that the small FLAN-t5 models from Google produced very good results, almost instantly, since the model has only ~60 million parameters. Here was our list of topics produced after using ‘flan-t5-small’ as our representation model:

from transformers import pipeline
from bertopic.representation import TextGeneration
from bertopic import BERTopic

#A prompt with context and explicit directions
prompt = "I have a topic in student surveys described by the following keywords: [KEYWORDS]. Based on the previous keywords, what is this topic about in a single word?"

# Create your representation model
generator = pipeline('text2text-generation', model='google/flan-t5-small', max_new_tokens=50)
representation_model = TextGeneration(generator, prompt=prompt)

# Create our model powered by TinyLlama
representation_model = TextGeneration(generator, prompt=prompt)

topic_model = BERTopic(

  # Pipeline models
  embedding_model=sentence_model,
  #umap_model=umap_model,
  hdbscan_model=hdbscan_model,
  vectorizer_model=vectorizer_model,
  representation_model=representation_model,

  # Hyperparameters
  top_n_words=10,
  verbose=True
)

Next, we tried substituting in TinyLlama, a condensed model built on the architecture of Llama 2, as our representation model. Despite experimenting with various prompts, TinyLlama tended to produce labels that were a bit long and bombastic; even when told to use a single word, the model often offered a sentence, like that eager student who volunteers to answer every question and writes five pages for an assignment requiring three.

#Let's try to use TinyLlama, a condensed version of Llama2 that should run a bit quicker.

import torch
from transformers import pipeline
from bertopic.representation import TextGeneration
from sklearn.feature_extraction.text import CountVectorizer

generator = pipeline("text-generation", model="TinyLlama/TinyLlama-1.1B-Chat-v1.0", torch_dtype=torch.bfloat16, device_map="auto")

#Providing more context in the prompt, with the structure required for Llama models
messages = [
    {
        "role": "system",
        "content": "You are a helpful, honest assistant labeling student survey responses",
    },
    {"role": "user", "content": "I have a topic in student surveys described by the following keywords: [KEYWORDS].  Here are a few representative survey responses: [DOCUMENTS] Based on the previous keywords, what is this topic about in a single word or phrase?"},
]
prompt = generator.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)


representation_model = TextGeneration(generator, prompt=prompt)

We also tried Google’s smaller Gemma model. This produced stronger results than TinyLlama, tending more toward pithy phrases that captured the topic well, such as “Career Development” or “Community Engagement.”

#An attempt to use Google's Gemma 2b as our represenation model
#Will need to login to Huggingface and be approved to use the model

from transformers import pipeline
from bertopic.representation import TextGeneration
from sklearn.feature_extraction.text import CountVectorizer

prompt = "I have a topic in student surveys described by the following keywords: [KEYWORDS]  Here are four representative survey responses: [DOCUMENTS].  Based on the previous keywords, what is this topic about in a single word or phrase?"

# Create your representation model
generator = pipeline('text2text-generation', model='google/gemma-1.1-2b-it', max_new_tokens=50)
representation_model = TextGeneration(generator, prompt=prompt)

Finally, we explored Zephyr (Mistral 7B) models, running them on CPU-only to test time and resource use. These took a bit longer to process, though they arguably produced the most satisfying representations, such as “Sustainable Campus Initiatives” or “Financial Transparency in Tuition Costs.”


from ctransformers import AutoModelForCausalLM
from transformers import AutoTokenizer, pipeline

# Set gpu_layers to the number of layers to offload to GPU. Set to 0 if no GPU acceleration is available on your system.
model = AutoModelForCausalLM.from_pretrained(
    "TheBloke/zephyr-7B-alpha-GGUF",
    model_file="zephyr-7b-alpha.Q4_K_M.gguf",
    model_type="mistral",
    gpu_layers=50,
    hf=True
)
tokenizer = AutoTokenizer.from_pretrained("HuggingFaceH4/zephyr-7b-alpha")

# Pipeline
generator = pipeline(
    model=model, tokenizer=tokenizer,
    task='text-generation',
    max_new_tokens=50,
    repetition_penalty=1.1
)

prompt = """<|system|>You are a helpful and honest assistant for labeling student surveys.</s>
<|user|>
I have divided survey responses into topics, grouping similar responses together.:
[DOCUMENTS]

The topic is described by the following keywords: '[KEYWORDS]'.

Based on the information about the topic above, please create one word or phrase label of this topic. Make sure you to only return the label and nothing more.</s>
<|assistant|>"""

# Text generation with Zephyr
zephyr = TextGeneration(generator, prompt=prompt)
representation_model = {"Zephyr": zephyr}

In the end, some combination of human review and a small model, like Flan-t5-small, gave us sensible and useful labels. Using this approach, we can divide a new set of a few thousand surveys into topics in only a few minutes, allowing us to quickly understand the main interests and concerns of our students.

The full notebook for this project can be found here.

Summarizing Topics in Higher Ed Survey Data with LLMs

Topic Modeling

Written by Kevin Chovanec