Prompt Engineering: Analysing gmail Inbox with PandasAI and OpenAI

Harmeet Sokhi
7 min readJun 30, 2023

--

Image Credit: Midjourney

In light of the emergence of GenAI, it appears that Prompt engineering is poised to assume a highly crucial role in the lives of developers, at least for the foreseeable future. Through this blog post, I will present my endeavour to analyse my Gmail inbox using PandasAI and OpenAI, highlighting the basic usage of prompt engineering.

Topics discussed in this blog are:

  • Downloading the Gmail data
  • Using Prompt with PandasAI library, along with the OpenAI integration to identify
    - Top email senders,
    - Common subject lines, &
    - Analyse trends in email reception over time
  • Using OpenAI gpt-3.5-turbo
    - Extract key topics of unread emails,
    - Categorise them into topics and themes, &
    - Identifying sentiments from email subjects

Downloading Gmail Data

Start by downloading gmail data. I used https://takeout.google.com/ to get the data in mbox format.It didn't take long and if you want you can also filter data if you don’t want to download your entire inbox.

Analysis Setup

I used the mailbox library to read the data in my python code

import mailbox
path = 'PATH OF THE MBOX FILE'
mbox = mailbox.mbox(path)

For my analysis. I extracted details such as the email’s subject, message content, sender (from), recipient (to), date, and the status. Additionally, I formatted the date appropriately, considering the presence of various formats and timezones that required careful handling.

#scroll through the mailbox file and extract the required data

mdates, msub, mmsg, mfrom, mto, mstatus = [], [], [], [], [], []

for message in mbox:
msub.append(message['subject'])
mfrom.append(message['from'])
mto.append(message['to'])

#extract date and format it
ts=message['date']
try:
mdates.append(str(parse(ts)).split(' ')[0] )
except Exception as e: ##its a bit ugly code but does the job :)
try:
index = ts.find("(")
ts = ts[:index]
mdates.append(str( parse(ts)).split(' ')[0] )
except Exception as e2:
print(message['date'])
print(e2)

#extract message
content = str("")
if message.is_multipart():
for part in message.get_payload():
content = content+str(part.get_payload(decode=True) )
else:
content = message.get_payload(decode=True)
mmsg.append(content)

#extract status
fullstring = message['X-Gmail-Labels']
substring = "Unread"
try:
mstatus.append("Unread" if substring in fullstring else "Opened")
except Exception as E:
mstatus.append("None")

Upload to pandas dataframe

import pandas as pd
df = pd.DataFrame()
df['Date'] = mdates
df['Subject'] = msub
df['From'] = mfrom
df['To'] = mto
df['Message'] = mmsg
df['Status'] = mstatus

#convert date from string to date format
import datetime as dt
df['Date'] = pd.to_datetime(df['Date'])

PandasAI setup

Now the fun part of prompt engineering. First create api keys from here. You will get $5 free credit. The below code sets up llm, along with the PandasAI.

from pandasai import PandasAI
from pandasai.llm.openai import OpenAI
open_api_key='Your OpenAI key'
llm = OpenAI(api_token=open_api_key)
pandas_ai = PandasAI(llm)

Prompts

Then I ran a few prompts to identify who is sending me most emails, what are the most common subject lines and it worked!!

pandas_ai.run(df, prompt='Which are the top 10 From addresses.List them as a new dataframe')
pandas_ai.run(df, prompt='Which are the most common subject.List them as a new dataframe')

The next question I had was understanding the trends of emails I am receiving month on month so I used the below prompt but it didn’t give me what I expected as my prompt was not clear

pandas_ai.run(df, prompt='Create a bar chart with month on month email received, using different colors')

Improved Prompt which gave me the right results

pandas_ai.run(df, prompt='Create a bar chart with month on month email received for each year, using different colors')

Line graph for the same analysis

pandas_ai.run(df, prompt='Create a line graph with month on month emails received for each year, using different colors')

Number of unread emails

pandas_ai.run(df, prompt='Create a bar graph for year on year  unread  status emails, using different colors')

Comparison Unread vs read emails

pandas_ai.run(df, prompt='Create a stack bar graph for year on year based on status of emails, using different colors')

Analysing Unread Emails

Prompt to filter unread emails and create a word cloud to see if anything extraordinary pops ups. Filter prompt worked very well but as expected the word cloud gave an error.

df_unread = pandas_ai.run(df,prompt="Filter Unread emails from the status")  

pandas_ai.run(df,prompt="Create a wordcloud of subject")
Generated code includes import of wordcloud which is not in whitelist.
'Unfortunately, I was not able to answer your question, because of the following error:\n\nGenerated code includes import of wordcloud which is not in whitelist.\n'

So I used the below code to create the word cloud.

import wordcloud
import matplotlib.pyplot as plt


# Create a wordcloud from the subject lines.
unique_string =""
for item in response5:
strings=item
unique_string = unique_string + str(strings)

wordcloud = wordcloud.WordCloud(width=800, height=600, max_words=1000).generate(unique_string)

# # Display the wordcloud.
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()

Using OpenAI

Extracting key themes from subject and emails

To analyse themes we will use OpenAI’s gpt-3.5-turbo model

import openai
import os

openai.api_key = open_api_key

def get_completion(prompt, model="gpt-3.5-turbo"):
messages = [{"role": "user", "content": prompt}]
response = openai.ChatCompletion.create(
model=model,
messages=messages,
temperature=0,
)
return response.choices[0].message["content"]

OpenAI has a limitation that the context length can be only 4097 tokens. So I filtered the data frame for the 2023 may data when I had received most emails.

Create the prompt for the chatgpt. I had a few prompt trials before getting the right response.

text=""
i=1
for index, row in include.iterrows():
pair=""
s= row['Subject']
f =row['From']
pair=f"[{{Subject:{s}}},{{Email:{f}}}],"
text=text + pair

prompt= f""" Determine themes in one word for each pair of \
subject and email address organised as a list of key value pairs delimited by triple backticks. \
Format your response as a python list of items separated by commas.\
```{text}```"""
response_theme = get_completion(prompt)

Response. Not bad !!!!

['Nursing', 'Infrastructure', 'Data Science', 'Self-Improvement', 'Technology', 'Events', 'Health', 'Food Delivery', 'Networking', 'Data Engineering', 'Car Parts', 'Payment Confirmation', 'Sales', 'Job Search', 'Habit Building', 'Birthday', 'Education',.....]

Which emails should I read!!

Due to size limitation I removed the email from my prompt so that I can send a longer prompt text.

text=""
for index, row in include_unread_month.iterrows():
pair=""
s= row['Subject']
pair=f"[{{Subject:{s}}}],"
text=text + pair
prompt_priority = f"""Determine priority for each  \
subject organised as a list of key value pairs delimited by triple backticks.
Give your answer as "Urgent," or "Action Required" or "Meh" \
Format your response as a python list of priorities separated by commas.\
```{text}```"""

response_priority = get_completion(prompt_priority)

The response:

["Meh", "Urgent", "Meh", "Urgent", "Meh", "Meh", "Urgent", "Meh", "Action Required", "Urgent", "Action Required", "Meh", "Action Required", "Meh", "Action Required", "Meh", "Urgent", "Urgent", "Urgent", "Meh", "Meh", "Meh", "Urgent", "Meh", "Meh", "Meh", "Meh", "Meh", "Meh", "Meh", "Meh", "Meh", "Meh", "Meh", "Meh", "Meh", "Meh", "Meh", "Meh", "Meh", "Meh",...]

Categorise the emails with themes

Other thing I tried was categorising the emails so that I can further combine it with the priority and filter them!!

prompt_category= f""" Determine theme in one word only for each  \
subject organised as a list of key value pairs delimited by triple backticks. \
ive your answer as "Marketing" or "Work" or "Personal" or "Other"\
Format your response as a python list of items separated by commas.\
```{text}```"""
response_category = get_completion(prompt_category)

Heres the response!! :

['Other', 'Other', 'Work', 'Marketing', 'Work', 'Marketing', 'Other', 'Work', 'Other', 'Work', 'Work', 'Other', 'Other', 'Work', 'Marketing', 'Other', 'Marketing', 'Other', 'Other', 'Other', 'Work', 'Other', 'Work', 'Other', 'Other', 'Work', 'Other', 'Other', 'Other', 'Marketing', 'Other', 'Other', 'Marketing', 'Other', 'Other', 'Other', 'Other', 'Other', 'Other', 'Other',...]

Understanding the sentiments

We can further priortise by getting sentiments. Note : I have just used subject lines due to prompt size limitation but for more concrete analysis we can surely use email content as well, however please consider priacy considerations of any LLM app that you would be using.

prompt_sentiment= f""" 
What is the sentiment of each subject organised as a list of key value pairs, \
which is delimited with triple backticks?

Give your answer as a single word, either "positive" \
or "negative" or "neutral".

Format your response as a python list of sentiment only and do no include subject separated by commas.\
subject list: '''{text5}'''
"""

response = get_completion(prompt_sentiment)
['positive', 'positive', 'neutral', 'positive', 'positive', 'positive', 'negative', 'neutral', 'positive', 'neutral', 'neutral', 'positive', 'neutral', 'positive', 'positive', 'positive', 'neutral', 'positive', 'neutral', 'neutral', 'positive', 'positive', 'positive', 'positive', 'positive', 'positive', 'positive', 'positive', 'positive', 'positive', 'positive', 'positive',...]

Finally I merged these outcomes and prioritised the unread emails that were work or personal, urgent and negative sentiment.

Conclusion:

This blog represents my initial foray into using prompt engineering, and a foundational exploration of its capabilities. There is a lot of scope to improve this further and refine this further by utilising other features of LLM and fine tuning it. There is no doubt that, it possesses transformative power, empowering individuals to extract valuable insights from intricate datasets. A key takeaway from my endeavour was the importance of formulating clear prompts to achieve accurate and desired outcomes.

--

--