Analyzing Tabeau Conference tweets with Tableau Desktop
I had a great time at 2017’s Tableau Conference in Vegas this year which just wrapped up yesterday. Always interested in seeing what people are excited about at the conference based on what’s being tweeted. As usual, the popular hashtags for this year’s event were #data17 and #tc17.
Prepping the data
Using my small Python script for pulling tweets (htspy), I pulled down all tweets containing those two hashtags for Monday — Thursday of the conference week.
I filtered out for any retweets and bucketed by 10 minute intervals to cut down on the choppiness. I used NLTK’s TweetTokenizer module to tokenize the tweets into words and Scikit-Learn’s stop words list to remove any stop words. Additionally, I put together a list of 2 n-grams (with stop words) which were all aggregated and counted. Terms appearing only once were omitted. Everything was then exported as a CSV for use in Tableau.
The code for pre-processing is at the bottom of this post.
Exploring Data
You can find an interactive version of the data on Tableau Public here. I tried formatting it for mobile, but the viz might be a little difficult to interact with on mobile. You’re better off with a desktop.
As usual, the first keynote by the tableau CEO generates the most buzz. There were around 218 tweets between 8:30AM — 8:39AM on Tuesday morning. The Devs On Stage event also generated some buzz later that afternoon.
The second keynote with Levin & Dubner brought in fewer tweets and the Iron Viz competition that afternoon was slightly inflated because the audience were encouraged to tweet for their favorite contestant. What I find interesting on this slide is the periodic volume of tweets that correlates to the start/end of the various sessions throughout the day. Surprisingly Data Night Out didn’t garner much tweet volume. Maybe people were busy having fun.
The last day had some on/off buzz but really the highlight was the final keynote session by Adam Savage. I cut off the Twitter feed a few hours after that ended.
Top Terms
With the dashboard, we can explore what the top most frequent terms mentioned within a given time window. The blue bars in the charts below show absolute volume. The orange bars show of all mentions of that term, what % occur within that window.
The CEO Keynote is this first chart below. lots of talk about excitement and getting ready for the conference. Adam Seplisky also talked about the role of AI in data analysis and used myths throughout his team. One being that AI will not replace human analysis, a sentiment received well in an arena of human analysts.
In the second part of the same session, the future roadmap was shared with the crowd including Hyper (HYPER HYPER!), project maestro, extensions, and data governance. I was impressed with the speed of Hyper and think it’ll be good as the world keeps moving to larger data sets. I also am excited for continued development around data governance and data loading. I feel these are Tableau’s weakpoints right now and improvements in this area will lead to better adoption by large organizations.
Later that day we had the Devs On Stage session. Tweets about the dashboard grid, density plots, nested sorts, and spatial joining all made the top terms. I’m particularly excited for spatial joins which I presently have to use ArcGIS for.
Finally was the keynote by Mythbuster Adam Savage. I find this one interesting since most people seemed excited to tweet about seeing the keynote, but I don’t see much around the content of the keynote itself.
Wrap up
Again, you can find the interactive version of the dashboard here
The code I used to preprocess the tweets is below. This assumes you already have the tweets stored in a MongoDB instance.
# coding: utf-8# In[53]:from pymongo import MongoClient
import pandas as pd
from nltk.tokenize import TweetTokenizer
from nltk import ngrams
from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS# In[56]:def get_tweets():
client = MongoClient()
col = client['twitter']['tableau17']
keep_fields = ['_id','created_at','is_retweet','text','user']
results = list(col.find({'is_retweet': False}, {k:1 for k in keep_fields}))
df = pd.DataFrame.from_dict(results)
df['user'] = df['user'].apply(lambda x: x['screen_name'])
return df
data = get_tweets()
data.head()# In[57]:def tokenize_tweets(data):
df = data.copy()
stops = ['#tc17','#data17','rt',':','…','...']
tokenizer = TweetTokenizer(strip_handles=True, reduce_len=True, preserve_case=False)
def tokens(x):
tokens = [t for t in tokenizer.tokenize(x.replace("'","")) if t not in stops and not t[:4]=='http' and not t[0]=="#" and len(t)>1]
return tokens
df['created_at'] = df['created_at'] + pd.Timedelta(hours=-7)
df['created_at_adj'] = df['created_at'].apply(lambda x: x.replace(second=0, minute=(x.minute//10) * 10))
df['tokens'] = df['text'].apply(tokens)
df['tokens_stops'] = df['tokens'].apply(lambda x: [t for t in x if t not in ENGLISH_STOP_WORDS])
df['ngrams'] = df['tokens'].apply(lambda x: list(ngrams(x, 2)) if len(x)>1 else None)
return dfdata_tokenized = tokenize_tweets(data)
data_tokenized.head()# In[58]:tokens = []
times = []
ids = []for i,r in data_tokenized.iterrows():
for t in r['tokens_stops']:
tokens.append(t)
times.append(r['created_at_adj'])
ids.append(r['_id'])
if not r['ngrams'] is None:
for t in r['ngrams']:
tokens.append(' '.join([g for g in t]))
times.append(r['created_at_adj'])
ids.append(r['_id'])df_words = pd.DataFrame({'token': tokens, 'created_at': times, 'id': ids})
#df_words = df_words[df_words]# In[59]:df_words.shape# In[67]:df_min_count = df_words.groupby(['token','created_at'])['id'].nunique().reset_index()
df_min_count = df_min_count[df_min_count['id']>1]
df_final = df_words.merge(df_min_count, on=['token','created_at'], how='inner').drop(['id_y'], axis=1)
df_final.columns=df_words.columns
df_final = df_final[(df_final.created_at>='2017-10-09 00:00:00') & (df_final.created_at<'2017-10-13 17:00:00')]# In[68]:df_final.head()# In[69]:df_final.shape# In[70]:df_final.to_csv('tc17-cloud.csv', index=False)