Twitter bird in Ancient Egyptian Art style with a background presenting mysterious language
Pyramids of the Modern Era? — Image Generated by Bing Image Creator (post-processed)

Cracking the ChatGPT Code: A Deep Dive into 500,000 Tweets using Advanced NLP Techniques

A Real-World Data Science Project Walkthrough: Discovering Trends, Insights, and Sentiments Surrounding the AI Conversation Revolution

Khalid Ansari
27 min readApr 19, 2023

--

You can access the Colab code used for this project [here].

I have also made the dataset available on Kaggle [here].

This allows you to follow along and explore the code and dataset as you read through the article. Now, let’s get started!

A bit about this project walkthrough

Amidst the countless breakthroughs in the realm of AI, ChatGPT consistently stands out due to its wide ranging utility across various aspects of society from creating personalized content such as emails, poems, and code snippets, to enhancing business workflows and educational tools.

In pursuit of a novel, challenging project that would both deepen my understanding of the data science pipeline and allow me to collect data from scratch, I embarked on a journey to collect and analyze 500,000 tweets about ChatGPT, the hottest topic of 2023.

With Twitter buzzing with discussions about Conversational Large Language Models like ChatGPT, BARD, and, Alpaca, I knew this platform was the perfect source for massive, temporal text data. The collected data spans from January 4th, 2023, to March 29th, 2023 (almost three months!), providing ample time to observe daily, weekly, and quarterly trends.

The data was collected using a Python module called snscrape, which enables the scraping of massive data from social media websites with a few lines of code. Snscrape bypasses the data and date limitations of other Twitter data collection tools such as Twitter API.

To learn about how I collected this dataset, check out my Medium guide on “Effortlessly Scraping Massive Twitter Data with Snscrape

In this comprehensive analysis, we’ll dive into (1)Data Understanding (2) Data Preprocessing (3)Exploratory Data Analysis - EDA (likes vs. retweets, Timeline & daily/weekly/monthly Trends, top hashtags & usernames, most liked tweets, and most influential users) (4)Impacts on Tech stocks from key AI developments (5)Text Analysis: (keyphrase extraction: bigrams & trigrams) and wordclouds (6)Topic Modeling with LDA (7)Sentiment Analysis (overall, on topics & accounts) (8)Limitations & Future work for the project.

This analysis will offer valuable insights into public opinion, ChatGPT’s benefits, and potential applications, shedding light on its impact and role in shaping the future of AI-powered conversational technologies. So, buckle up, and let’s dive in!

This project (and many code snippets) was inspired by an article from @Clément Delteil: Unsupervised Sentiment Analysis With Real-World Data: 500,000 Tweets on Elon Musk

A quick note before diving in: I’ll mention key steps/snippets and outcomes for the sake of simplicity and better conceptual understanding. You can find a detailed Google Colab notebook with the entire code at the beginning of this article.

1. Understanding the Data

It’s important to understand the data before doing any other Data Science tasks. This helps us better comprehend what Data Science tasks are possible with this data and the insights we’ll glean after finishing the project. Starting a project with this step helps us adopt a value-driven approach to successfully communicate the findings to appropriate stakeholders.

“Data Understanding lays a solid foundation for the direction of a Data Science project.” ~Author

Let’s check out our dataframe!

df = pd.read_csv('/content/my_data_500k.csv')
df.head(10)
Top 10 elements of our dataset — Image by author

They’re really talking about it! The diversity of expressions!

Our dataset has the following columns:

date: Date of the tweet post

id: A unique identifier of the tweet

content: Actual tweet content

username: Username of the Twitter user

like_count & retweet count: Like and Retweet counts for that tweet

Let’s get some basic statistics about the dataset which will help us understand its various features at a glance.

# What's the shape of our Dataframe?
print("Length: ",len(tweet_df))
print("Shape: ",df.shape)

# Getting descriptive statistics on likes and retweets:
print(df[['like_count', 'retweet_count']].describe())

# Checking the number of unique values in each column:
for col in df.columns:
print(col, ":", df[col].nunique())

# Average tweet length in words
count = 0
for i in df['content']:
count += len(''.join(str(i).split()))
avg_tweet_length = count / len(df['content'])
print("Average Tweet length is:", avg_tweet_length)
Dimensions of the Dataframe
Descriptive statistics on likes and retweets — Image by Author
Checking unique values in each column — Image by Author
Number of words in an average tweet — Image by Author

Now that we have a 50,000-foot view of our data, let’s see how we can make this dataset more usable by refining it in the Data Processing step!

2. Data Preprocessing

This is perhaps the most important step of the Data Science pipeline and is often overlooked by beginners. Data Preprocessing improves the dataset’s quality, enhancing the accuracy and reliability of the analysis. By cleaning and standardizing data, preprocessing eliminates inconsistencies or errors, leading to more meaningful insights.

2.1 Removing missing values, date extraction

We’ll drop missing values, extract only the date component of the ‘date’ column, check the range of dates, and check unique values in each column again.

# check for missing values
print(df.isnull().sum())

# Remove missing values
df = df.dropna()
print("Length: ",len(df))

# Convert the 'date' column to a datetime object
df['date'] = pd.to_datetime(df['date'])
# Extract the date component and assign it to the 'date' column
df['date'] = df['date'].dt.date
# Again convert this extraced date component to datetime object
df['date'] = pd.to_datetime(df['date'])

# Checking range of dates
print("Start Date: " ,df['date'].min())
print("End Date: " ,df['date'].max())

# Checking the number of unique values in each column
for col in df.columns:
print(col, ":", df[col].nunique())
Missing values — Image by Author
New dataframe after dropping missing values — Image by Author
Extracted Datetime object — Image by Author
Date range of date column — Image by Author
Unique values in each column — Image by Author

Total Tweets: 499974

Unique Tweets: 493705

Our dataframe is now a little more usable, but there’s more preprocessing to be done on the Twitter content data.

2.2 Removing links, hashtags, mentions, unwanted characters, and lower casing

We’ll preprocess the ‘content’ column to remove <links>, #hashtags, @ mentions, and unwanted characters as they serve no purpose for our text analysis, and make everything lowercase for uniformity.

# One preprocessing function to rule them all(almost!)
def pre_process(text):
# Remove links
text = re.sub('http://\S+|https://\S+', '', text)
text = re.sub('http[s]?://\S+', '', text)
text = re.sub(r"http\S+", "", text)

text = re.sub('&amp', 'and', text)
text = re.sub('&lt', '<', text)
text = re.sub('&gt', '>', text)

# Remove new line characters
text = re.sub('[\r\n]+', ' ', text)

text = re.sub(r'@\w+', '', text)
text = re.sub(r'#\w+', '', text)

# [for another use case where person/topic is not mentioned in text
# but only as a #hashtag or a @mention]
# Keeps the character trailing @ and #
# text = re.sub(r'@\w+', lambda x: re.sub(r'[^a-zA-Z0-9\s]', '', x.group(0)), text)
# text = re.sub(r'#\w+', lambda x: re.sub(r'[^a-zA-Z0-9\s]', '', x.group(0)), text)

# Remove multiple space characters
text = re.sub('\s+',' ', text)

# Convert to lowercase
text = text.lower()
return text

We’ll apply this function to our tweet ‘content’ column.

df['processed_content'] = df['content'].apply(pre_process)
New ‘processed content’ column — Image by Author

Let’s see how many unique tweets are there in ‘processed_content’ column

# Checking the number of unique values in each column
for col in df.columns:
print(col, ":", df[col].nunique())
Unique preprocessed content— Image by Author

That’s about 40,000 fewer tweets. This might be due to repeated or spam tweets containing different links, hashtags, mentions, indentations, or structures.

2.3 Removing redundancy and spam

So here’s a thing:

Original Content count: 499974

Unique Content count: 493705

Unique Pre_Processed: 458210

Focusing on the DataFrame containing unique preprocessed content(458,210 tweets) enables a more accurate analysis, as it accounts for 92% of the original dataset while minimizing the impact of noise and repeated content. This approach allows us to concentrate on extracting distinct insights and ensures a cleaner, more reliable dataset for analysis.

Important: We’ll only keep the content rows with the highest engagement and drop the other copies. To do this we have to:

  1. Sort the dataframe based on the engagement metric (highest likes)
  2. Keep the first copy and drop duplicates based on ‘processed_content’
  3. Sort the dataframe again based on the index
# Sort dataframe by like_count, highest to lowest
df_sorted = df.sort_values(by='like_count', ascending=False)
# Only keep first copy of the tweet (with highes likes)
df_cleaned = df_sorted.drop_duplicates(subset='processed_content', keep='first')
# Sort dataframe by index
df_final = df_cleaned.sort_index()
Final dataframe size — Image by Author

This reduces the length of our dataframe to 458,210 from nearly 500k. A lot of redundancy and repeated content have been removed. Remember, there are always pros and cons to these steps.

Removing duplicates might lead to some biases in understanding various aspects of the dataset, as we may lose information on the popularity or frequency of some topics. But you can see below that most of the repeated content (keyphrase variants of chat gpt and join links to server) would only thwart our analysis to focus on unique insights. So it’s better to drop the duplicates in this case.

Understanding the repetition in preprocessed content — Image by Author

3. Exploratory Data Analysis:

3.1. Relationship between Likes and Retweets

A high number of likes may not translate into a high number of retweets, and vice versa, indicating that the two metrics represent different types of user engagement on Twitter

The Problem: Our dataset is too large to draw a scatterplot of like vs. retweet count meaningfully, and the distribution of values is extremely non-uniform.

The Solution: Most of the like and retweet counts lie below 2000 and 500 respectively, so we’ll filter our dataframe to only include the data points below the thresholds for like count (2000) and retweet count (500).

We’ll draw a scatter plot of likes against retweet count along with the Ordinary Least Square Regression line.

# Define the maximum thresholds for like count and retweet count
max_like_count = 2000
max_retweet_count = 500

# Filter the DataFrame to include only rows where like_count and retweet_count are below the thresholds
df_filtered = df[(df['like_count'] <= max_like_count) & (df['retweet_count'] <= max_retweet_count)]


# Scatter plot of likes vs. retweets
fig = px.scatter(df_filtered, x='like_count', y='retweet_count', trendline="ols", trendline_color_override="red")
fig.update_layout(title='Likes vs. Retweets', xaxis_title='Likes', yaxis_title='Retweets', height=800, width=1200)
fig.show()
Likes vs Retweets Scatterplot with Regression line —Image by Author

We can observe from the above scatterplot that a high number of likes may not necessarily translate into a high number of retweets, and vice versa. This could indicate that these two metrics represent different types of user engagement on Twitter. Users may engage differently with content depending on factors such as information type, audience preferences, or tweet context. Let’s try to understand each of these.

  • Information Type: Informative content might be more likely to be retweeted to spread awareness while entertaining content could receive more likes due to its enjoyable nature.
  • Audience Preferences: Some users might like tweets if they align with their values while other users might retweet the same content to share with their followers.
  • Tweet Context: Tweets that relate to a timely and relevant topic might accumulate more retweets to enrich the conversation while a personal or emotional context could receive more likes as a form of support or empathy.

3.2 Timeline Analysis

ChatGPT tweet volume peaks on news days, with lower activity on weekends and consistent weekday engagement

Note: the data is from January 4th to March 29 2023 ~ 3 months

Visualizing Tweet volume against the dates will give us richer insights into User Engagement(Tweet volume) on ChatGPT based on major events surrounding the technology.

To do this we’ll draw barplots to glean daily, weekly, and quarterly trends based on the volume of Tweets.

3.2.1 Tweets per day

sns.set_style('darkgrid')

# Number of tweets per day
tweets_by_day = df.groupby(pd.Grouper(key='date', freq='D')).size().reset_index()
tweets_by_day.columns = ['date', 'count']
fig2 = px.bar(tweets_by_day, x='date', y='count', title='Number of Tweets per Day', color = 'count', height=800, width=1500)
fig2.update_xaxes(tickangle=45, tickformat='%Y-%m-%d')
fig2.show()
Tweets/Day — Graphic by author

Understanding the peaks:

Source(Wikipedia):

Timeline of ChatGPT : https://timelines.issarice.com/wiki/Timeline_of_ChatGPT

  1. Feb 7, 2023 (11847 tweets):
  • Google presents its own AI chatbot called Bard, a competitor to ChatGPT.

2. Feb 8, 2023 (9242 tweets): 3 interesting events

  • (Competition) Chinese company Alibaba Group announces that it is developing a rival to OpenAI’s ChatGPT AI chatbot.
  • (Study) A study explores the potential of ChatGPT, a popular AI chatbot, in generating academic essays that can evade plagiarism detection tools.
  • A paper proposes a framework for evaluating interactive language learning models (LLMs) such as ChatGPT using publicly available data sets. The authors evaluate ChatGPT’s performance on 23 data sets covering eight different NLP tasks and find that ChatGPT outperforms other LLMs on most tasks, but has a lower accuracy in reasoning and suffers from hallucination problems.

3. March 15, 2023 (10929 tweets):

  • On March 14th, OpenAI announces GPT-4, the latest and most capable AI language model in its line of language models

4. March 17, 2023 (9873 tweets):

  • Sam Altman (CEO of OpenAI) after the release of GPT-4 appears in an interview with ABC News and says that AI technology will reshape society as we know it, but that it comes with real dangers.

5. March 24, 2023 (8638 tweets):

  • OpenAI announces ChatGPT implementation support for plugins, which are tools designed for language models to access up-to-date information, run computations, or use third-party services with safety as a core principle.

Understanding the Valleys:

In general, Tweet volume on ChatGPT is lowest on Sundays, followed closely by Saturdays. The volume of Tweets jumps on Mondays and remains almost consistent throughout the weekdays.

3.2.2 Tweets per Week:

tweets_by_week = df.groupby(pd.Grouper(key='date', freq='W-MON',label='left')).size().reset_index()
tweets_by_week.columns = ['week', 'count']
tweets_by_week['week_start'] = tweets_by_week['week'].dt.strftime('%Y-%m-%d')
tweets_by_week['week_end'] = (tweets_by_week['week'] + pd.Timedelta(days=6)).dt.strftime('%Y-%m-%d')
fig = px.bar(tweets_by_week, x='week_start', y='count', title='Number of Tweets per Week', height=400, width=800, color = 'count')
fig.update_xaxes(tickangle=45)
fig.show()
Tweets/Week — Graphic by author

We can observe from the weekly Tweet volume count visualization that the first 3 weeks of February had the highest volume given the Feb 7 & 8 events(Bard, Alibaba, Evaluating LLMs paper).

2nd and 3rd week of March saw the highest combined tweet volume for any given two weeks in our timeline. This was following the release of GPT-4 and later its support for ChatGPT Plugins.

3.2.3 Tweets per Month:

ChatGPT tweet volume has consistently risen, suggesting sustained popularity and potential growth as OpenAI expands integration through APIs and plugins

tweets_by_month = df.groupby(pd.Grouper(key='date', freq='M',label='left')).size().reset_index(name='count')
fig = px.bar(tweets_by_month, x='date', y='count', title='Number of Tweets by Month', height=400, width=400, color = 'count')
fig.update_xaxes(title_text='Month')
fig.update_yaxes(title_text='Count')
fig.show()
Tweets/Month — Graphic by author

There has been a steady increase in the number of Tweets regarding ChatGPT ever since January, and perhaps even from it’s first release in November 30, 2022.

January: ~130K Tweets

February: ~160K Tweets

March: ~172K Tweets

Although about 4 months have passed since ChatGPT’s release, it’s popularity has aged really well, and this trend is likely to continue as OpenAI integrates in into mainstream domains through API & Plugins.

3.2 Top #Hashtags, @ Mentioned Users, and Active Users

Top hashtags and mentions highlight ChatGPT’s association with AI, OpenAI, and major tech giants, as well as key organizations and personalities in the field.

This project would be incomplete without understanding the trendy (most used) Hashtags and Most mentioned Organizations & Personalities.

Let’s dive right in!

3.2.1 Top 20 Hashtags

#ChatGPT was too obvious, so I’ve adjusted the plotting index to remove it. This will allow us to better understand the rest of the hashtags.

hashtags = df['content'].str.findall(r'#\w+')
hashtags_count = hashtags.explode().value_counts()
fig_hashtags = px.bar(x=hashtags_count.index[1:21], y=hashtags_count[1:21], title='Top 20 Hashtags',color = hashtags_count[1:21]) #color_discrete_sequence=['#00CC96']
fig_hashtags.update_xaxes(tickangle=45)
fig_hashtags.update_layout(width=1200, height=800)
fig_hashtags.show()
Most mentioned Hashtags — Graphic by author

Top hashtag charts are claimed by variants of Artificial Intelligence, OpenAI, and tech giants such as Google and Microsoft.

3.2.2 Top 20 Mentions

This will shed some light on key Organizations and Personalities the users relate with ChatGPT.

mentions = df['content'].str.findall(r'@\w+')
mentions_count = mentions.explode().value_counts()
fig_mentions = px.bar(x=mentions_count.index[:20], y=mentions_count[:20], title='Top 20 Mentions', color=mentions_count[:20])
fig_mentions.update_xaxes(tickangle=45)
fig_mentions.update_layout(width=1200, height=800)
fig_mentions.show()
Most mentioned accounts — Graphic by author

OpenAI and ChatGPT are the obvious ones, followed by Elon Musk(Co-founder of OpenAI), Microsoft (has heavily invested in OpenAI), YouTube, Google (Competitor), @ sama (Sam Altman, CEO of OpenAI), etc.

3.2.3 Top 20 Users with the Highest Tweet Count

High tweet volume doesn’t necessarily correlate with influence, as seen in the most influential user analysis

This will uncover users with the highest volume of Tweets on the topic. It may also help us identify spam accounts that tend to have high Tweet volumes.

tweets_by_user = df.groupby('username').size().sort_values(ascending=False)
fig_users = px.bar(y=tweets_by_user.index[:20], x=tweets_by_user[:20], title='Top 20 Active Users', orientation='h', color=tweets_by_user[:20])
fig_users.update_layout(width=1200, height=800)
fig_users.show()
Users with most Tweets — Graphic by author

It’s interesting that none of these users appear in the most influential category in the next part, meaning volume doesn’t always positively correlate with likes/retweets.

3.3 Most Liked Tweets && Most Influential Users

Companies can use analysis of top tweets and influential users to engage with key partners and refine messaging strategies. Public interest is driven by announcements, humor, and information, while influential users succeed with high-impact tweets or consistent, valuable content

By identifying influential users and most liked tweets, companies can engage with relevant key partners, and tailor their messaging to revamp their social media and product strategies.

I Had to do some content moderation as the most liked Tweet was (not so family-friendly) humor 😄, which also gives us insight 🧐 into how users can leverage humor on new topics to attract attention.

3.3.1 Top 11 Most Liked Tweets

df.sort_values(by='like_count', ascending=False)[['date','like_count','username','content']][1:12]
Most liked tweets — Image by Author

Most liked tweets represent the topics and opinions that the public is most interested in. Some interesting mentions from the top 11 most-liked tweets include:

  • @sama aka Sam Altman (CEO of OpenAI) announcing the release of GPT-4 on March 14th, 2023
  • A misinformed tweet by @AlexHarmozi (multimillionaire and YouTuber) contained an image of a dot against a massive circle depicting the training data size of GPT-3 vs. GPT-4. This helped perpetuate a massive hype campaign in favor of OpenAI.
  • @lexfridman a reputed personality in the field of Machine Learning and podcasting announced his podcast episode with Sam Altman (CEO of OpenAI) on March 16 2023 and release on March 25, 2023
  • Other tweets about ChatGPT include people expressing a sense of humor, spreading informative tools, and the impacts users see in real-time.

Now, let’s let the visualizations do some talking….

3.3.2 Top 20 Single Tweets by Users with Most Likes:

Let’s see which users had the most impact with just one of their Tweets in ChatGPT conversation.

fig_most_liked = px.bar(most_liked_tweets, x='username', y='like_count', text='like_count', 
title='Top 20 Tweets by users with Most Likes',
color_discrete_sequence=['#00CC96'],
width=1200, height=800)
fig_most_liked.update_traces(texttemplate='%{text:.2s}', textposition='outside')
fig_most_liked.show()
Single Tweets with Most likes— Graphic by author

These are the winners of our single shot Tweet like competition!

Let’s look at the top 4:

  1. MoistCr1TiKaL: American YouTuber, Twitch streamer and musician
  2. johnvianny: Affiliate marketer
  3. rgay: A writer for social change
  4. aaronsiim: talks about web3 • generative ai • biotech.

3.3.3 Top 20 Users by Total Tweet Likes:

This will help us understand if multiple tweets can also be effective in gaining a foothold in the online discourse.

user_likes = df.groupby('username')['like_count'].sum().reset_index()
user_likes_sorted = user_likes.sort_values(by='like_count', ascending=False).head(20)

fig_likes = px.bar(user_likes_sorted, x='username', y='like_count', text='like_count', title='Top 20 Influential Users by Total Likes',
color_discrete_sequence=['#00CC96'],width=1200, height=800)
fig_likes.update_traces(texttemplate='%{text:.2s}', textposition='outside')
fig_likes.show()
Users by Total Tweet Likes— Graphic by author

And these are the winners of our multi-shot Tweet like competition!

  1. DataChaz: Developer at Streamlit ( Talks about AI )

It’s very interesting that DataChaz took first place via the accumulation of multiple Tweet likes. Other top users (MoistCr1TiKaL, johnvianny, rgay, aaronsiim) remained the same from the previous section.

DataChaz taking first place through accumulating multiple Tweet likes highlights the importance of consistent engagement and valuable content in social media growth. While other top users gained influence through single high-impact tweets, DataChaz achieved a much higher influence, demonstrating the importance of sharing consistent and insightful content.

3.3.4 Top 20 Users by Total Tweet Retweets:

user_tweets = df.groupby('username')['retweet_count'].sum().reset_index()
user_tweets_sorted = user_tweets.sort_values(by='retweet_count', ascending=False).head(20)

fig_retweets = px.bar(user_tweets_sorted, x='username', y='retweet_count', text='retweet_count', title='Top 20 Influential Users by Total Retweets',
color_discrete_sequence=['#00CC96'],width=1200, height=800)
fig_retweets.update_traces(texttemplate='%{text:.2s}', textposition='outside')
fig_retweets.show()
Users by Total Tweet Retweets— Graphic by author

Idea for further analysis: Understand how likes & retweets roughly contribute to reach and engagement, normalize both, aggregate for each user and then find out the ultimate influential figure in the ChatGPT game!

4. Impacts on tech stocks from key AI developments

Major AI events, such as OpenAI’s GPT-4 & plugin releases, impacted tech stocks, with Microsoft and Nvidia benefiting positively. Google, a competitor with its BARD project, saw negative stock effects, while Meta experienced fluctuations. Amazon and IBM showed no strong correlation with these events.

OpenAI, at the moment, is not yet a publicly-traded company. Simply put, no public stocks, so we’ll have to rely on extrapolation.

Here, we’ll learn how major AI events, may correlate with stock market movements of relevant tech companies (Microsoft, Google, Amazon, Meta, Nvidia, IBM). We’ll examine this by plotting stocks with tweet volumes. Here are some key mentions:

MICROSOFT (a major investor in OpenAI)

GOOGLE (BARD — Major Competitor to ChatGPT by OpenAI)

META (Competitor)

NVIDIA (Nvidia GPUs used by Microsoft which helped OpenAI train ChatGPT)

# Fetch stock data for MSFT, Google, and other competitors
start_date = '2023-01-04'
end_date = '2023-03-29'
ticker_symbols = ['MSFT', 'GOOGL', 'AMZN', 'META', 'NVDA', 'IBM']
stocks_df = yf.download(ticker_symbols, start=start_date, end=end_date)['Adj Close']

# Number of tweets per day
tweets_by_day = df.groupby(pd.Grouper(key='date', freq='D')).size().reset_index()
tweets_by_day.columns = ['date', 'count']

# Create a combined plot
fig = go.Figure()

# Add tweet count trace
fig.add_trace(go.Bar(x=tweets_by_day['date'], y=tweets_by_day['count'], name='Tweet Count', opacity=0.5))

# Add stock price traces
for symbol in ticker_symbols:
fig.add_trace(go.Scatter(x=stocks_df.index, y=stocks_df[symbol], name=symbol, yaxis='y2'))

# Customize the layout
fig.update_layout(
title='Stock Prices and Tweet Counts',
xaxis=dict(title='Date', tickangle=45, tickformat='%Y-%m-%d'),
yaxis=dict(title='Tweet Count', side='left'),
yaxis2=dict(title='Stock Price', side='right', overlaying='y1', position=0.95),
width=1200, height=800
)
fig.show()
Stock Prices over Tweet Volumes— Graphic by author

We got two major dates: Stock price observation with arrows ⤴↘➖

  1. Feb 7 2023 (Google BARD released)
  • Microsoft & Nvidia :
  • Meta (Facebook)
  • Google (BARD complications)
  • Amazon, IBM
  1. Mar 14 2023 (GPT 4 released)
  • Microsoft & Nvidia
  • Meta (Facebook) ⤴↘➖
  • Google, Amazon, IBM

Microsoft, a major OpenAI investor, and Nvidia, supplier of GPU’s for ChatGPT/GPT-3/4 training both benefited from OpenAI’s advancements, positively impacting their stocks. Google, an OpenAI competitor with it’s BARD project, showed negative stock impact due to complications around it’s release. Meta, which is another AI competitor, saw negative stock fluctuations on both major dates. Amazon and IBM showed no strong correlation with the above major events.

Please take note that correlation doesn’t always imply causation. However, understanding and possibly predicting rough tweet volumes and user engagement based on key dates and stock prices can guide organizations in strategic planning, resource allocation and marketing efforts.

5 Text Analysis

5.1 Top Bigrams and Trigrams

Top bigrams and trigrams reveal users discussing AI, language models, generative AI, and search engines. ChatGPT’s potential impact on search engines and financial markets, as well as misinformation surrounding GPT-4’s capabilities, are notable in the findings

Here we’ll see some common phrases and expressions associated with ChatGPT by finding the most common Bigrams and Trigrams and finally visualizing the top results. We’ll skip the first common bigram as it’s obviously ‘chat gpt’.

def get_top_n_ngrams(corpus, n=None, ngram=2):
vec = CountVectorizer(ngram_range=(ngram, ngram), stop_words='english').fit(corpus)
bag_of_words = vec.transform(corpus)
sum_words = bag_of_words.sum(axis=0)
words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
words_freq = sorted(words_freq, key=lambda x: x[1], reverse=True)
return words_freq[:n]

# Get top 20 bigrams
common_bigrams = get_top_n_ngrams(df['processed_content'], 20, ngram=2)

# Get top 20 trigrams
common_trigrams = get_top_n_ngrams(df['processed_content'], 20, ngram=3)

# make them into dataframes
df_bigrams = pd.DataFrame(common_bigrams, columns=['NgramText', 'count'])
df_trigrams = pd.DataFrame(common_trigrams, columns=['NgramText', 'count'])

# Plot bigrams
fig_bigrams = px.bar(df_bigrams[1:], x='NgramText', y='count', title='Bigram Counts', color = 'count',width=1200, height=800)
fig_bigrams.show()

# Plot trigrams
fig_trigrams = px.bar(df_trigrams, x='NgramText', y='count', title='Trigram Counts', color = 'count',width=1200, height=800)
fig_trigrams.show()
Top Bigrams— Graphic by author

Bigrams are strongly centered around the applications of ChatGPT, such as ‘write’, ‘ask’, ‘tools’, ‘search engine’, etc.

Top Trigrams— Graphic by author

Similar is the case with Trigrams, indicating the prevalence of it’s applications amongst Twitter users.

The Problem: The bigrams and trigrams above don’t give any real insight as words like ‘chat’, ‘gpt’, ‘chatgpt’ are causing a lot of redundancy.

The Solution: We’ll modify our ‘get_top_n_ngrams’ function to exclude the keywords ‘chat’, ‘gpt’, ‘chatgpt’. The modified function takes an additional argument to exclude a provided list of keywords from bigrams and trigrams.

def get_top_n_ngrams(corpus, n=None, ngram=2, exclude_keywords=None):
if exclude_keywords is None:
exclude_keywords = []

vec = CountVectorizer(ngram_range=(ngram, ngram), stop_words='english').fit(corpus)
bag_of_words = vec.transform(corpus)
sum_words = bag_of_words.sum(axis=0)
words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
words_freq = sorted(words_freq, key=lambda x: x[1], reverse=True)

# Exclude n-grams containing specified keywords
words_freq = [item for item in words_freq if not any(keyword in item[0] for keyword in exclude_keywords)]

return words_freq[:n]

# Get top 20 bigrams
common_bigrams = get_top_n_ngrams(df['processed_content'], 20, ngram=2, exclude_keywords=['chat', 'gpt', 'chatgpt'])

# Get top 20 trigrams
common_trigrams = get_top_n_ngrams(df['processed_content'], 20, ngram=3, exclude_keywords=['chat', 'gpt', 'chatgpt'])

df_bigrams = pd.DataFrame(common_bigrams, columns=['NgramText', 'count'])
df_trigrams = pd.DataFrame(common_trigrams, columns=['NgramText', 'count'])

# Plot bigrams
fig_bigrams = px.bar(df_bigrams, x='NgramText', y='count', title='Bigram Counts', color='count', width=1200, height=800)
fig_bigrams.show()

# Plot trigrams
fig_trigrams = px.bar(df_trigrams, x='NgramText', y='count', title='Trigram Counts', color='count', width=1200, height=800)
fig_trigrams.show()

Improved results

Updated Top Bigrams— Graphic by author

It’s evident from the top bigrams that users are talking about Artificial Intelligence, Language Models (ChatGPT), Generative AI(such as ChatGPT), AI tools, Search Engine, etc.

Search Engine appears as the 5th most prominent bigram, which makes sense given the significant impact Conversational Language Models like ChatGPT could have on search engine popularity. Notably, Google’s search Engine. Microsoft has recently incorporated GPT-4 chat features and capabilities into it’s Bing browser to rival Google’s offering. It’d be an interesting match to watch! Crumbling monopoly?

Updated Top Trigrams — Graphic by author

We observe several expected terms related to LLMs(Large Language Models) such as ‘large language models’, ‘natural language processing’, ‘ai language model’.

Interestingly, we also came across the terms “500 times powerful” and “times powerful current” which might be sometime around the release of GPT-4. People possibly may have believed misinformation or news about GPT-4 being 500 times more powerful than GPT-3.

Another intriguing finding is the appearance of “15 min chart, “trade 15 min”, “min chart free”, “chart free train” which seem to be related to day trading strategies. The 15-minute chart is a favorite tool among day trader for profiting from large price movements throughout the day. This may mean that people might be discussing potential applications of ChatGPT in financial markets and day trading.

5.2 WordClouds

Common keywords relating to ChatGPT are, write, use, question, work, create, prompt, make, ask, ai, need, time

Using wordclouds we can creatively convey a snapshot view of the main topics and connections in the ongoing ChatGPT discussions on Twitter.

Unigrams(words): This gives us a visual of the most common topics & applications surrounding ChatGPT.

For unigrams, we’ll first have to lemmatize, which is a process of reducing words to their base form called lemma, allowing us to group similar word forms together.

Example: lemma for “run”, “running”, and “ran” is “run

# Initialize Lemmatize
wordnet_lem = WordNetLemmatizer()

# Lemmatize processed text and join everything in a list
df['content_lem'] = df['processed_content'].apply(wordnet_lem.lemmatize)
all_words_lem = ' '.join([word for word in df['content_lem']])

Let’s generate some wordclouds. I’ll use the Twitter logo for the wordcloud mask to make the results looks relevant to the project. We’ll also define custom color function and custom text style (link) to make wordcloud more interesting. If you want to use a custom font, make sure to download it as a .tff file and provide the path in the WordCloud generator.

Font: Created by Jonathan Harris (link)

Mask Image: (link)

from random import choice

# Define a custom color function
def custom_color_func(word, font_size, position, orientation, random_state=None, **kwargs):
colors = ['#1DA1F2', '#00CC96', '#FF5733', '#FFC300', '#E91E63', '#9C27B0', '#673AB7']
return choice(colors)
# Generate a word cloud image
mask = np.array(Image.open("/content/twitter_logo1.jpg"))
stopwords = set(STOPWORDS)

wordcloud_twitter = WordCloud(height=2000, width=2000,
background_color="white", mode="RGBA",
stopwords=stopwords, mask=mask, color_func=custom_color_func,
font_path='/content/SketchBook-B5pB.ttf').generate(all_words_lem)

plt.figure(figsize=[10, 10])
plt.axis('off')
plt.tight_layout(pad=0)
plt.imshow(wordcloud_twitter.recolor(color_func=custom_color_func), interpolation="bilinear")

# Store visualization to file
plt.savefig("twitter_unigram.png", format="png")

plt.show()
Unigram WordCloud — Image by Author

Understandably, some of the most common words that people relate with ChatGPT are write, use, question, work, create, prompt, make, ask, ai, need, etc.

Mentions(accounts): We’ll get a bird eye view of the key players around the ChatGPT buzz.

# Extract mentions and concatenate all mentions in one string
all_mentions = ' '.join([mention[1:] for mentions in df['content'].str.findall(r'@\w+') for mention in mentions])

wordcloud_twitter = WordCloud(height=2000, width=2000,
background_color="white", mode="RGBA",
stopwords=stopwords, mask=mask, color_func=custom_color_func,
font_path='/content/SketchBook-B5pB.ttf').generate(all_mentions)

# Create coloring from the image
plt.figure(figsize=[10,10])
plt.axis('off')
plt.tight_layout(pad=0)
plt.imshow(wordcloud_twitter.recolor(color_func=custom_color_func), interpolation="bilinear")

# Store visualization to file
plt.savefig("twitter_logo_unigram_mentions.png", format="png")

plt.show()
Mentions WordCloud — Image by Author

Apart from the obvious, some mentions are Elon Musk, YouTube, Google, Microsoft, Sama (Sam Altman CEO of OpenAI), Bing, DataChaz.

6. Topic Modeling with LDA — on top 10,000 most liked tweets

LDA topic modeling on the top 10,000 most liked tweets reveals themes like ChatGPT’s capabilities, API support, learning, AI tools’ impact, and the role of major tech companies in AI development

Now that we’ve seen the most apparent topics, let’s try to uncover some hidden themes and topics. Topic Modeling would be a perfect candidate for that purpose

Latent Dirichlet Allocation (LDA) is an unsupervised machine learning technique used for topic modeling. LDA helps us discover hidden thematic structures within a collection of documents. Tweets in our case.

Let’s discover how topic modeling using LDA can help us reveal key themes and insights by analyzing the top 10,000 most liked tweets about ChatGPT.

To perform LDA, we’ll have to first preprocess the text of top 10,000 liked tweets by removing stop words, lemmatizng, and tokenizing the words. Then we’ll create a dictionary to represent the words and their frequencies. Moving on, we’ll convert the text data into a document-term matrix. Finally, we’ll build the LDA model with 10 topics, using Gensim’s LdaModel. One can play with parameters such as ‘random state’, ‘chunk size’, ‘passes’, and ‘alpha’ to obtain better topic modeling results.

# Let's sort the dataframe and get top 10000 most liked tweets
df_sorted = df.sort_values(by='like_count', ascending=False)
df_top_10000 = df_sorted.iloc[:10000]

# Text Preprocessing
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()
docs = df_top_10000['processed_content'].apply(lambda x: [lemmatizer.lemmatize(word) for word in nltk.word_tokenize(x.lower()) if word.isalpha() and word not in stop_words])

# Create a dictionary of words and their frequency
dictionary = corpora.Dictionary(docs)

# Create a document-term matrix
corpus = [dictionary.doc2bow(doc) for doc in docs]

# Topic modeling using LDA
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus, id2word=dictionary, num_topics=10, random_state=100, update_every=1, chunksize=100, passes=10, alpha='auto', per_word_topics=True)

# Print the topics and their top words
for idx, topic in lda_model.print_topics(-1):
print('Topic: {} \nWords: {}'.format(idx, topic))
print('\n')
10 topics churned out by LDA — Image by Author

Let’s attempt to interpret some of the topics:

Note: this is an unsupervised machine-learning method and may not always give coherent and understandable answers.

Topic 1: Asking questions and getting answers, with a focus on asking for help or information probably from ChatGPT.

Topic 2: Discussion about the ChatGPT model by OpenAI, its capabilities, and its support for APIs.

Topic 3: Learning and the role of companies like Microsoft in developing AI and large-scale models.

Topic 6: Language models, their quality, and new developments in the field.

Topic 8: ChatGPT features, and the role of GPT in generating text and poems.

Topic 9: AI tools, their use in the workplace, and their potential impact on various tasks such as writing.

7. Sentiment Analysis

lexical-based sentiment analysis reveals predominantly positive sentiment for ChatGPT tweets, with Vader showing a broader sentiment range compared to TextBlob

Here, we have a few options to explore the sentiments of the tweets in our dataset.

  1. Clustering based approaches: These methods employ unsupervised machine learning techniques such as K-means clustering. Similar text data points are grouped together. Although, it doesn’t interpret the sentiments of the groups.
  2. Pretrained Transformers from Hugging Face: Models like multilingual XLM-roBERTa-base-sentiment, trained on a massive dataset of ~198M tweets and fine-tuned for sentiment analysis, offer a powerful deep learning solution for sentiment classification. This approach can be computationally expensive but gives the best results.
  3. Lexical-based approaches: Tools like TextBlob and Vader rely on pre-build sentiment wordlists to determine the sentiment of a given text. These approaches are computationally efficient.

We’ll go with the third approach(Lexical-based) as it provides a quick and effective way to gauge sentiment without the need for too much computational power, making it suitable for our 500k dataset.

TextBlob & Vader

We’ll calculate polarity scores for both Vader and TextBlob, create a single dataframe and plot a comparative histogram to compare the scores.

from nltk.sentiment.vader import SentimentIntensityAnalyzer
from textblob import TextBlob

# Calculate Polarity using Vader
nltk.download('vader_lexicon')
sid = SentimentIntensityAnalyzer()
df['vader_polarity'] = df['processed_content'].map(lambda text: sid.polarity_scores(text)['compound'])

# Calculate Polarity using TextBlob
df['blob_polarity'] = df['processed_content'].map(lambda text: TextBlob(text).sentiment.polarity)

# Combine both polarities to make a dataframe
polarity_df = df[['vader_polarity', 'blob_polarity']]
polarity_df = polarity_df.rename(columns={'vader_polarity': 'Vader','blob_polarity': 'TextBlob'})

# Plot a historam to compare the polarities of both the methods
fig = px.histogram(polarity_df, x=['Vader', 'TextBlob'], nbins=40, barmode='group', color_discrete_sequence=['#1DA1F2', '#00CC96'])
fig.update_layout(title='Distributions of sentimental polarities Vader Vs. TextBlob', xaxis_title='Polarity', yaxis_title='Count',width=1200, height=800)
fig.show()
Vader VS. TextBlob Polarity Distribution— Image by author

From the above graph, it’s evident that both methods while they rely on the same mechanism, give varying results.

Vader polarity scores are more spread out across the scale (-1 to 1) compared to TextBlob which assigned more scores close to the center.

Both methods indicate overall positive sentiment on the ChatGPT tweets. Most of the tweets for both methods are categorized as neutral (0 score) with TextBlob assigning ~180k and Vader assigning ~150k neutral scores.

It’s an interesting observation that Vader has assigned more extreme sentiments compared to TextBlob.

7.1 Sentiments on Topics

Now that we have sentiment scores across the board, let’s understand the emotions and public opinion surrounding the key theme and topics related to ChatGPT.

Stopwords cannot be used to assign sentiments for our method, so we’ll just remove the stopwords.

stop_words = nltk.corpus.stopwords.words('english')

def remove_stop_words(text):
text = text.translate(str.maketrans('', '', string.punctuation))
return ' '.join([word for word in text.split() if word.lower() not in stop_words])

df['stop_content'] = df['processed_content'].apply(lambda x: remove_stop_words(x))

Let’s define the topics, calculate the sentiments.

# We define a list of topics
topics = ['ai', 'chatgpt', 'elonmusk', 'openai', 'google', 'sama', 'microsoft' ,'youtube', 'billgates', 'linkedin', 'bing', 'midjourney']

# We create a new column Topic
df['Topic'] = ""
for topic in topics:
df.loc[df['stop_content'].str.contains(topic), 'Topic'] = topic

# We create a new DataFrame with columns topic / sentiment / source
data = []
for topic in topics:
topic_rows = df[df['Topic'] == topic]
# Average sentiment per topic
vader_sentiments = topic_rows['vader_polarity'].sum() / topic_rows.shape[0]
textblob_sentiments = topic_rows['blob_polarity'].sum() / topic_rows.shape[0]
# Append data
data.append({'Topic': topic, 'Sentiment': vader_sentiments, 'Source': 'Vader'})
data.append({'Topic': topic, 'Sentiment': textblob_sentiments, 'Source': 'TextBlob'})

df_new = pd.DataFrame(data)

Now, we’ll visualize the sentiments with barplot.

# Plot the sentiment for each topic
fig = px.bar(df_new,x='Topic',y='Sentiment',color='Source',barmode='group',color_discrete_sequence = ['#1DA1F2', '#00CC96'],
title='Comparative sentimental analysis by topic',template='plotly_white',width=1200, height=800)
fig.show()
Sentiments on Topics— Graphic by author

Although positive overall for both methods, Vader sentiments are stronger than TextBlob. We’ve yet to figure out how Bill Gates managed to get some negative press.

MidJourney is one of the best text-to-image generation AI tools out there. Users seem to have good sentiments on this technology.

7.2 Sentiments on Accounts

Vader and TextBlob gave the highest positive score to AIPADTECH, DataChaz, ‘satyan and ella’ underlying similarity between the two methods. Jordan Peterson received the lowest positive polarty ratings indicating sentiment divide among the users.

Here, we’ll examine the sentiments associated with the top 20 most mentioned accounts in our dataset. By doing so, we can identify influential users/accounts who shape the conversation around ChatGPT, whether they drive positive engagement or spark controversy and debate.

# Extract top 20 mentions
mentions = df['content'].str.findall(r'@\w+')
mentions_count = mentions.explode().value_counts()
top_mentions = mentions_count[:20].index.tolist()

vader_sentiments = df['vader_polarity'].tolist()
textblob_sentiments = df['blob_polarity'].tolist()
text = df['content'].tolist()

# Create a new column for the username
df['Mention'] = ""
for mention in top_mentions:
df.loc[df['content'].str.contains(mention), 'Mention'] = mention

# Create a new dataframe with columns for username, sentiment, and sentiment source
data = []
for mention in top_mentions:
mention_rows = df[df['Mention'] == mention]
vader_sentiments = mention_rows['vader_polarity'].sum() / mention_rows.shape[0]
textblob_sentiments = mention_rows['blob_polarity'].sum() / mention_rows.shape[0]
data.append({'Mention': mention, 'Sentiment': vader_sentiments, 'Source': 'Vader'})
data.append({'Mention': mention, 'Sentiment': textblob_sentiments, 'Source': 'TextBlob'})
df_new = pd.DataFrame(data)
# plot the sentiment for each username using Plotly
fig = px.bar(df_new,x='Mention',y='Sentiment',color='Source',barmode='group',color_discrete_sequence = ['#1DA1F2', '#00CC96'],
title='Comparative sentimental analysis by accounts',template='plotly_white',width=1500, height=800)
fig.show()
Sentiments On Accounts — Graphic by author

Vader rated AIPADTECH, ‘satyan and ella’, DataChaz and OpenAI with the highest positive polarity, while TextBlob gave it to AIPADTECH, DataChaz, ‘satyan and ella’ and Bing.

While Vader is more decisive in assigning the polarity scores, TextBlob seems to be more timid with little difference between it’s highest and lowest polarity scores.

What’s interesting is that even with this difference both Vader and TextBlob gave the highest positive score to AIPADTECH, DataChaz, ‘satyan and ella’. This finding ensures the correctness of our top candidates as well as underscores some similarities between the two methods.

Jordan Peterson received the lowest positive polarty ratings. He’s considered one of the most controversial figures on the internet. He had also raised a concern regarding ChatGPT’s bias, especially from a political lens.

Conclusion

Upon conclusion of this analytical deep dive through 500k tweets surrounding ChatGPT, I am amazed at what we were able to glean from the data that I scraped in just about 8 hours. From data collection to sentiment analysis, we have explored one of the many corners of Twitter’s tumultuous landscape and to some extent were able to distill real-world insights.

In this analysis, we removed nearly 50,000 redundant tweets out of 500,000, while preprocessing our massive data. Through Exploratory Data Analysis, we found that tweet volume correlated with key AI events and also observed a steady increase in ChatGPT’s popularity since it’s release. We also identified the most influential users and discovered higher influence equates well with sharing consistent & insightful content and not necessarily with higher tweet volume.

Our analysis of hashtags, mentions, and user activity revealed the main players and topics in the ChatGPT discussion, while most-liked tweets offered a glimpse into public opinion and likings such as humor, GPT updates, informative tools, and misinformation. We also found that events surrounding ChatGPT impacted the stocks of Microsoft, Nvidia, Google, and Meta. Text Analysis(n-grams) and Topic Modeling helped us uncover common themes and potential applications for ChatGPT, such as its application in financial markets and day trading.

By doing sentiment analysis on our data, we observed an overall positive sentiment towards ChatGPT. Vader and TextBlob methods differed in scoring with some surprising similarities. We were able to understand the prevalent applications of ChatGPT amongst twitter users, which ranged from writing creative content, humor, enhancing business workflow to improving learning.

Looking at the findings, it’s clear that the real-world applicability of this project is immense, as it provides valuable real-world insights for companies, researchers and policymakers alike. By understanding the trends, public opinion and potential applications of ChatGPT, we can make informed decisions, strategize effectively and shape the future of AI-powered conversational technologies. A future that’s in our hands!

And with that, we’re done! Congratulations on completing the project and thank you for sticking around till the end!

Limitations of this project: This project is more technique focused and only important observations are expounded due to the complexity, and time constraints.

Future work & Alternative ideas:

1. A focused deep dive into many of the above analyses will provide even more insights, such as:

a) Running the pipeline for each month/week/day to do a more focused analysis.

b) Filtering dataframe to understand a specific popular user. The type of content do they put out? And at what times(weekly, daily, time of the day)?

c) How are users utilizing ChatGPT for day trading?

d) Understanding the ChatGPT bias controversy that was raised by Jordan Peterson.

2. Trying different approaches for the above analyses, such as:

a) For Topic Modeling you can try NMF (Non-negative Matrix Factorization) instead of LDA.

b) Using Pretrained Transformers for Sentiment Analysis instead of TextBlob/Vader for more accurate results, as it also understands the context.

May this exploration serve as an insightful voyage into the fascinating and evolving oceans of AI.

References

This project (and many code snippets) was inspired by an article from @Clément Delteil: Unsupervised Sentiment Analysis With Real-World Data: 500,000 Tweets on Elon Musk

Say hi 👋 to me on [LinkedIn]

Checkout 💻👀 my [GitHub]

--

--

NYU CS Graduate || NLP | AI/ML/DS | writes codes | creates content. Email: ka2612@nyu.edu