Real Time Insights from social media data — Data Science Case Study

Vishvdeep Dasadiya
Apr 25 · 8 min read

In this blog:

  • Local and global thought patterns
  • Prettifying the output
  • Finding common trends
  • Exploring the hot trend
  • Digging deeper
  • Frequency analysis
  • Activity around the trend
  • A table that speak a 1000 words
  • Analysing used languages
  • Final thoughts

Local and global thought patterns

While we might not be twitter fans, we have to admit that it has huge influence on the world. Twitter data is not only fold in terms of insights, but Twitter-storms are available for analysis in near real-time. This means we can learn about the big waves of thoughts and moods around the world as they arise.

import json
# Load WW_trends and US_trends data into the the given variables respectivelyWW_trends = json.loads(open('/content/WWTrends.json').read())US_trends = json.loads(open('/content/USTrends.json').read())

Prettifying the output

Our data was hard to read! Luckily, we can resort to the jason.dumps() method to have it formatted as a pretty JSON string.

# Pretty-printing the results. First WW and then US trends.
print("WW trends:", WW_trends)
print("\n", "US trends:", US_trends)

Finding common trends

🕵️‍♀️ From the pretty-printed results (output of the previous task), we can observe that:

  • At query time #BeratKandili, #GoodFriday and #WeLoveTheEarth were trending WW.
  • “tweet_volume” tell us that #WeLoveTheEarth was the most popular among the three.
  • Results are not sorted by “tweet_volume”.
  • There are some trends which are unique to the US.
# Extracting all the WW trend names from WW_trends
world_trends = set([trend['name'] for trend in WW_trends[0]['trends']])
# Extracting all the US trend names from US_trends
us_trends = set([trend['name'] for trend in US_trends[0]['trends']])
# Getting the intersection of the two sets of trends
common_trends = world_trends.intersection(us_trends)
# Inspecting the data
print(world_trends, "\n")
print(us_trends, "\n")
print (len(common_trends), "common trends:", common_trends)

Exploring the hot trend

🕵️‍♀️ From the intersection (last output) we can see that, out of the two sets of trends (each of size 50), we have 11 overlapping topics. In particular, there is one common trend that sounds very interesting: #WeLoveTheEarth — so good to see that Twitteratis are unanimously talking about loving Mother Earth! 💚

# Extracting all the WW trend names from WW_trends
world_trends = set([trend['name'] for trend in WW_trends[0]['trends']])
# Extracting all the US trend names from US_trends
us_trends = set([trend['name'] for trend in US_trends[0]['trends']])
# Getting the intersection of the two sets of trends
common_trends = world_trends.intersection(us_trends)
# Inspecting the data
print(world_trends, "\n")
print(us_trends, "\n")
print (len(common_trends), "common trends:", common_trends)
Image Source:Official Music Video Cover: https://welovetheearth.org/video/
# Loading the data
tweets = json.loads(open('/content/WeLoveTheEarth.json').read())
# Inspecting some tweets
tweets[0:2]

Digging deeper

🕵️‍♀️ Printing the first two tweet items makes us realize that there’s a lot more to a tweet than what we normally think of as a tweet — there is a lot more than just a short text!

# Extracting the text of all the tweets from the tweet object
texts = [tweet['text'] for tweet in tweets]
# Extracting screen names of users tweeting about #WeLoveTheEarth
names = [user_mention['screen_name'] for tweet in tweets for user_mention in tweet['entities']['user_mentions']]
# Extracting all the hashtags being used when talking about this topic
hashtags = [hashtag['text'] for tweet in tweets for hashtag in tweet['entities']['hashtags']]
# Inspecting the first 10 results
print (json.dumps(texts[0:10], indent=1),"\n")
print (json.dumps(names[0:10], indent=1),"\n")
print (json.dumps(hashtags[0:10], indent=1),"\n")

Frequency analysis

🕵️‍♀️ Just from the first few results of the last extraction, we can deduce that:

  • A lot of big artists are the forces behind this Twitter wave, especially Lil Dicky.
  • Ed Sheeran was some cute koala in the song — “EdSheeranTheKoala” hashtag! 🐨
# Importing modules
from collections import Counter
# Counting occcurrences/ getting frequency dist of all names and hashtagsfor item in [names, hashtags]:
c = Counter(item)
# Inspecting the 10 most common items in c
print (c.most_common(10), "\n")

Activity around the trend

🕵️‍♀️ Based on the last frequency distributions we can further build-up on our deductions:

  • DiCaprio is not a music artist, but he was involved as well (Leo is an environmentalist so not a surprise to see his name pop up here).
  • We can also say that the video was released on a Friday; very likely on April 19th.
retweets = [ (tweet['retweet_count'], tweet['retweeted_status']['favorite_count'], tweet['retweeted_status']['user']['followers_count'], tweet['retweeted_status']['user']['screen_name'], tweet['text']) for tweet in tweets   if 'retweeted_status' in tweet]

A table that speaks a 1000 words

Let’s manipulate the data further and visualize it in a better and richer way — “looks matter!”

# Importing modules
import matplotlib.pyplot as plt
import pandas as pd
# Create a DataFrame and visualize the data in a pretty and insightful format
df = pd.DataFrame(retweets, columns['Retweets','Favorites','Followers','ScreenName','Text']).groupby(['ScreenName','Text','Followers']).sum().sort_values(by=['Followers'], ascending=False)
df.style.background_gradient()

Analysing used languages

🕵️‍♀️ Our table tells us that:

  • Even if celebrities like Katy Perry and Ellen have a huuge Twitter following, their followers hardly reacted, e.g., only 0.0098% of Katy’s followers liked her tweet.
  • While Leo got the most likes and retweets in terms of counts, his first tweet was only liked by 2.19% of his followers.
# Extracting language for each tweet and appending it to the list of languages
tweets_languages = []
for tweet in tweets:
tweets_languages.append(tweet['lang'])
tweets_sources = []
for tweet in tweets:
tweets_sources.append(tweet['source'])
# Plotting the distribution of languages
%matplotlib inline
plt.hist(tweets_languages)

Find thoughts

🕵️‍♀️ The last histogram tells us that:

  • Polish, Italian and Spanish were the next runner-ups.
  • There were a lot of tweets with a language alien to Twitter (lang = ‘und’).
plt.hist(tweets_sources)
data

MLearning.ai

Data Scientists must think like an artist when finding a solution

MLearning.ai

Data Scientists must think like an artist when finding a solution, when creating a piece of code.Artists enjoy working on interesting problems, even if there is no obvious answer.

Vishvdeep Dasadiya

Written by

Masters in Data Science at BITS Pilani | DataRishi | Machine Learning | Deep Learning

MLearning.ai

Data Scientists must think like an artist when finding a solution, when creating a piece of code.Artists enjoy working on interesting problems, even if there is no obvious answer.