NLP with iMessage data
Welcome to my 2023 year wrap for a group chat entitled “The Real Housewives of Georgia Tech” (a college friend group chat). I performed natural language processing analysis on all of our text messages over the past year — and I’m sharing the results in this article!
I will soon release a step-by-step post on how to access this iMessage data on a Mac. I may even include a Github link with the code to clean it (; so stay tuned.
But for now — I’m going to skip to the fun part — the results (and code snippets of how I obtained the results).
As most group chats go, one person kept it alive and thriving. The real housewives of Georgia Tech 2023’s Yapper of the Year goes to Stormi ❤
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
# get just text messages, no reactions
text_messages = df[(df.reaction.isnull())&(df.body.notnull())].reset_index(drop=True)
text_messages
# plot
colors = sns.color_palette("rocket", n_colors=len(range(7)))
sns.countplot(x=text_message["Name"], hue=messages_df["Name"], palette=colors, legend=False).set(title = "Pure Messages")
Regarding reactions received, my friends care about one thing more than the rest. Laughs.
2023’s Jokester of the Year goes to Adeline 💀
and since this was such a close category…
2023’s Honorable mention — Got Some Laughs goes to Rain 😅
# Thank you stackoverflow post for a guidline: https://stackoverflow.com/questions/34615854/countplot-with-normalized-y-axis-per-group 🙏
x,y = 'reaction_name', 'reaction'
df2 = df.groupby(x)[y].value_counts()
countDict = df['Name'].value_counts().to_dict()
for x in countDict.keys():
df2.at[x] = df2.at[x] / countDict[x]
df2 = df2.mul(100)
df2 = df2.rename('percent').reset_index()
colors = sns.color_palette("rocket", n_colors=len(reaction_message["reaction"].unique()))
g = sns.catplot(x=df2['reaction_name'],y=df2['percent'],hue=df2['reaction'],kind='bar', palette= colors).set(title = "Reactions Recieved Normalized")
g.ax.set_ylim(0,22)
One other quick stat I calculated was everyone’s biggest fan. I defined that as who reacted the most to an individual’s messages… And it turned out one person was everyone’s biggest fan!
2023’s Megafan goes to Jean 🪭
2023’s Low-key Stan (who reacted to Jean the most) goes to me❤
x,y = 'reaction_name', 'Name'
df2 = df.groupby(x)[y].value_counts()
countDict = df['Name'].value_counts().to_dict()
for x in countDict.keys():
# normalize it
df2.at[x] = df2.at[x] / countDict[x]
df2 = df2.mul(100)
df2 = df2.rename('percent').reset_index()
# plot time
colors = sns.color_palette("rocket", n_colors=len(reaction_message["Name"].unique()))
g = sns.catplot(x=df2['reaction_name'],y=df2['percent'],hue=df2['Name'],kind='bar', palette= colors).set(title = "Biggest Fan Normalized")
g.ax.set_ylim(0,22)
Now for some legit NLP, for the next part of this project I performed the following on the text messages:
- Sentiment analysis
- Topic detection (zero shot and had my friends guess the topics lol)
- GPT3.5-Turbo Summaries
Data Prep
- imported the right packages
import seaborn as sns
import matplotlib.pyplot as plt
from transformers import pipeline
from huggingface_hub import hf_hub_download
import pandas as pd
2. read the df, and subset it to only text messages, no reactions
df = pd.read_csv('engineerd_df.csv')
text_messages = df[(df.reaction.isnull())&(df.body.notnull())].reset_index(drop=True)
text_messages
Sentiment Analysis
For sentiment analysis I used a fine tuned model from Hugging Face (proper citation at bottom): https://github.com/pysentimiento/pysentimiento.
# define pipeline
pipe = pipeline("text-classification", model="finiteautomata/bertweet-base-sentiment-analysis")
# call on sentiment model
sentiment_results = pipe(texts, truncation=True)
# transform to a dataframe
sentiment_df = pd.DataFrame(sentiment_results)
# concat with ful text message df
result = pd.concat([text_messages, sentiment_df], axis=1)
result
2023’s Neutral Nomad goes to Lan 🌝
2023's Downer Dan goes to Rain 🌈
and 2023’s Positive Pumpkin goes to Jean 🎃
x,y = 'Name', 'label'
df1 = result.groupby(x)[y].value_counts(normalize=True)
df1 = df1.mul(100)
df1 = df1.rename('percent').reset_index()
colors = sns.color_palette("coolwarm", n_colors=len(result["label"].unique()))
g = sns.catplot(x=df1.Name,y=df1.percent,hue=df1.label,kind='bar', palette= colors, aspect=2).set(title = "Normalized Sentiment")
g.ax.set_ylim(0,100)
for p in g.ax.patches:
txt = str(round(p.get_height(),2)) + '%'
txt_x = p.get_x()
txt_y = p.get_height()
g.ax.text(txt_x,txt_y,txt)
Topic Analysis
I implemented Facebook’s bart-large-mnli and zero-shot classification pipeline for topic analysis (also on Hugging Face).
What this means in normal english is I used a really big language model that has never seen the following categories:
['Significant others', 'life updates', 'Franky', 'politics', 'weekend plans']
I then called on this model to assign a probability score (ranging from 0 to 1) that the text message pertains to the categories.
pretty friken cool huh? Anyways Amelia picked “Significant others” — and that was the most talked about topic. Therefore:
2023’s Philosopher of the Year goes to Amelia 🧐
Amelia may not be the most vocal in the group chat — but she is listening and taking it all in! Here is my code:
from tqdm import tqdm
pipe = pipeline(model="facebook/bart-large-mnli")
message_list = text_messages['body'].to_list()
# function to process messages in batches and yield results
def process_messages_batch(message_list, batch_size=500):
results_list = []
for i in range(1000, len(message_list), batch_size):
batch_messages = message_list[i:i + batch_size]
# iterate through the batch of messages
for message in tqdm(batch_messages, desc=f"Processing batch {i // batch_size + 1}", unit="message"):
# Get the scores for each candidate label
result = pipe(message, candidate_labels=["Franky", "weekend plans", "Significant others", "life updates", "politics"])
# create a dictionary with the message and scores
entry = {"short_message": message}
entry.update({label: score for label, score in zip(result['labels'], result['scores'])})
# append the dictionary to the results list
results_list.append(entry)
# save the results for the current batch
df = pd.DataFrame(results_list)
df.to_csv(f'results_batch_{i // batch_size + 1}.csv', index=False)
results_list = [] # clear the list for the next batch
# assuming you have a list of messages called message_list
process_messages_batch(message_list)
And last but not least — I gave daddy OpenAI 5 big ones and made some summaries of each month over the past year. These are mostly for my friends to reminisce on but heres a cute one:
‘February’: Drama, love, and a touch of chaos all wrapped up in one group text! During this month, the group text conversation includes discussions about data protection laws, existential thoughts on the soul and body, a traumatic power outage during a shower, Stormi’s short-lived relationship, plans for a Y2K party and a trip to DC and New York, Adeline’s impressive headshot, and Julia’s adventures in Costa Rica and Cancun. Plus, Jean will be speaking at an event but unfortunately, it’s at 5:30 am.
# import and initialize OpenAI
client = OpenAI(api_key=key)
# make a column with the name and then the text message
text_messages['combo'] = text_messages['Name'].astype(str) +':'+ text_messages['body']
text_messages
# function to call on api, prompt included
def first_summary_openai(string, month):
completion = client.chat.completions.create(
model="gpt-3.5-turbo",
messages=[
{"role": "system", "content": "You are a gossip-savvy AI summarizer, skilled in capturing the juiciest details of events with a flair for intrigue."},
{"role": "user", "content": f"Summarize the following group text message conversation from the beginning of {month} in 1-2 sentances:{string}"}
]
)
return completion.choices[0].message.content
months = ['January','February','March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December']
summarys = {}
for i in range(1,13):
# subset texts
texts=text_messages[text_messages['month_column']==i]
# clean texts
listy = texts['combo'].to_list()
text_string = ""
for s in listy:
text_string += s+'. '
text_string = text_string.replace('\n', ' ')
# split text bc limitation of characters
split = len(text_string )//3
twoThirds = split+split
part1 = text_string[:split]
part2 = text_string[split:twoThirds]
part3 = text_string[twoThirds:]
# call openai
part1_sum = first_summary_openai(part1, months[i-1])
part2_sum = middle_summary_openai(part2, months[i-1])
part3_sum = end_summary_openai(part3, months[i-1])
month_total = part1_sum + ' ' + part2_sum + ' ' +part3_sum
summarys[months[i-1]] = month_total
Hope you enjoyed my 2023 year wrap. Let me know if you have any questions!
Citations:
@misc{perez2021pysentimiento,
title={pysentimiento: A Python Toolkit for Sentiment Analysis and SocialNLP tasks},
author={Juan Manuel Pérez and Juan Carlos Giudici and Franco Luque},
year={2021},
eprint={2106.09462},
archivePrefix={arXiv},
primaryClass={cs.CL}
}