The hashtag to rule them all 👉 Analysis of company tweets

Data analysis of Twitter hashtags and mentions (with cool functions & plots!)

Marta
Towards Data Science

--

Photo by Jan Baborák on Unsplash

We rush through things and rarely stop to smell the flowers. The Stoics tell us to ponder the bigger picture of our lives, I decided to stop and analyse the bigger picture of our brand’s Twitter account.

We can poetically title this data analysis as 💐 The Smell of the Twitter Flowers.

Goals

A lot can be done with Twitter data and, trust me, I tortured it almost all the way I could (see current version of the notebook). But, for this post we’ll focus on the frequency and use of hashtags and mentions. There will soon be other posts about such delicious topics like correlations, emojis, and saddest days in the week.

👉 Let’s jam: The dataset

The dataset has been downloaded from Twitter Analytics, and covers the period form April to September 2020. As much as I’d love to have more, Twitter doesn’t store data going further back or, if it does, it doesn’t make it available to the account owner.

The dataset includes all the tweets sent in this period from the account @makingjam. Most of these glorious, JAMmy tweets have been written by me so any ensuing text analysis will be an insight into my marketing brain and sense of humour — brace yourselves.

😩 Step 1: What a mess

It starts off very smoothly, we read all separate csv files into one data frame with this neat script.

What we get as a result is a messy data frame of 529 rows and 40 columns. I’ll save you all the cleaning, you can see it all in the notebook. What’s important is that we dropped a tone of useless columns.

Let’s play with our “favourite” datatype, datetime.

What we have to do is:

  • create properly formatted date and hour columns, which will be important later.
  • add two columns for days of the week: one with its numerical representation, and one with a human-friendly representation, as a string.

Now, the first few rows of the data frame look like this.

Before we proceed, we’ll also add two columns to hold values for number of hashtags and number of mentions in each tweet.

🐲 Step 2: Enter the monster function

There is a lot we want to extract from the text in the data frame. For both hashtags and mentions, separately, we want:

  • a list of all of them
  • the number of unique hashtags or mentions
  • a sets of uniques
  • a data frame* with the frequency of use
  • top 10 most frequent ones

*could be a dictionary. Feel free to rewrite my monster function to include a dictionary and a tuple to complete the list of returned datatypes.👌

To retrieve all this^ I wrote a monster of a function. A monster, for my standards, that is.

Bonus points for me for instantiating a class and object along with it. If the way I did or the fact that it’s within a function is something atrocious for those more versed in Python — tell me. It was honestly the first time I’ve done it outside of a tutorial.

Now we can get all the stats about the hashtags. What joy!

Sharing a screenshot instead of a series of short gists. I hope you forgive me.

🍬 Plot!

Remember, we also have a whole data frame with the frequency of the hashtags. What does it mean? A treat for pyplot and seaborn fans — we can plot them!

Output:

Guess what the brand Making Jam is about!

Yup, the top hashtags we use reflect the truth: Making Jam is about product management (and events). And, the events we were tweeting about in the period that the data frame covers were JAM London, The Remote PM, and JAM Barcelona.

🔎 Step 3: Zoom in on the hashtag

What if we’d like to see is to plot the use of one specific hashtag over a specified period of time?

Be my guest!

First, we’ll have to create a column holding all hashtags used in each tweet. To do that we create a function with a name like a title of a film with Sylvester Stallone. 💪

To plot the use of a chosen hashtag, we’ll need to vectorise the text — create a sparse matrix reflecting which hashtag is present in which tweet. To do it we only need the text column with hashtags, but we’ll need the date to later plot the results.

Output:

Yes, the data frame is mostly zeros. Looks pretty boring, but it’s very useful, because now we can plot the frequency use of a chosen hashtag in a time period. 🎉

You probably noticed I’m writing most of these functions in a way that can later be reused to get the same stats about the mentions. 🤔 THINKING FORWARD 🤔

Let’s plot the use of the hashtag that stood for the event The Remote PM, which took place in May 2020.

plot_usage(hahstags_vectorized, ‘TheRemotePM’, date1=’2020–04–20', date2=’2020–05–21')

🙋‍♀️ Question for 5-year olds: Can you guess which day the Remote PM took place?

Output:

🎖 Most used hashtag

We might also be interested in knowing the most used hashtag in a specified time period.

most_used_h_or_m(hahstags_vectorized, '2020-06-01', '2020-06-30')

Output:

('product', 14)

And the most used hashtag in the whole set.

The variables `earliest_tweet` and `latest_tweet` were defined in the beginning of the notebook.

Output: The most used hashtag for the whole dataset is: product, with 127 mentions.

💁‍♀️ Conclusion: Most of these tweets are about product, product productum at omnia product, to paraphrase a famous gloomy quote.

💙 Step 3: Our Twitter’s best friend

This section is going to be shorter because we already have all the functions and we can just re-use to handle mentions. If you read so far, I’m sure you simply can not wait to learn all the stats about the mentions used in this dataset.

Let’s deploy the monster function on the @.

Sharing a screenshot so you can see both the code and the output, otherwise there’d be too many short snippets.

Some cool people and groups here! Shoutouts to Matt LeMay, Mathilde Leo, Triangirls Gibson Biddle, Susana Videira Lopes, Sofia Quintero, Tim Herbig, Kosta Kolev, and Alexis Odysseos.

Next up, the “Sylvester Stallone function” enters again to create a column with a list of mentions as a preparatory step for vectorisation.

Vectorise!

Output:

There is a clear resemblance between this plot and the hashtag plot for the same time period. 🤔

What did Matt LeMay have to do wit The Remote PM?

💁‍♀ ️He was the host!

Last step:

Output: The most used mention in the whole dataset is mattlemay, with 37 mentions.

I admit the phrasing here is a bit objectifying, sorry Matt. But, we have also identified you as our Twitter’s best friend! At least for the time period the dataset covers.

👀 What’s next?

I’ll be expanding the dataset each month downloading a new csv from Twitter analytics. I’m pretty sure there is room for automation in there. And, perhaps once I master Tableau—watch this space!—a corresponding dashboard might appear.

This is one of the posts from a series of posts inspired by our brand’s Twitter dataset. You can more here:

--

--

📈 Aspiring data scientist. Rationality fan. EA. Vegan. Working to improve global mental health at MindEase.io