The Twitter Age Conundrum

Why working with real tweets isn’t always a yellow brick road to understanding the public sentiment.

Aboli Marathe
Omdena

--

Recently, I had the pleasure of joining Omdena, a global platform committed to AI for Social Good, on one of their exciting real-world AI challenges. Working beside the venerable Fondation Botnar, 50 brilliant collaborators were summoned from all over the world, to bring their best wits and save the world, not unlike the Aurors in Harry Potter. Here the magic was AI.

Twitter! Source here

Unlike quotidian problem statements, Omdena allows the collaborators to explore their creative side. There are no limitations to hinder the potential of the researchers, the only target is to do social good. Our particular challenge was mosaic: To Understand the Sentiments, Thoughts, and Aspirations of Young People.

Bytes for Breakfast

To start off, our team started brainstorming on what media the youth are using widely today. We started scraping Twitter data because we found from recent surveys that one in three teenagers has a Twitter account (33.33 percent).

After scraping over a million tweets on youth-related topics like college, Justin Bieber, school, and bullying, we realized the muddle we were in. We had tons of data and no method of knowing which tweet belonged to which age category!

Team Twitter’s hilarious motto! Source here

The Final Problem

To estimate the age of the people whose tweets made up our data set, we faced innumerable issues. Even if it is possible to guess the age from a sizable corpus, to identify the age from just a few words or hashtags is a complex task. Furthermore, if you actually analyze people and their personalities, you will realize that mental and physical age are vastly different concepts and sometimes disparate too. We can never be sure if a 20-year-old woman, has created a tweet representative of her age due to the probability of her having a different mental age from her physical one.

We had tons of data and no method of knowing which tweet belonged to which age category!

Solving for age! Source here

To solve this problem, we found a very relevant paper that proved that certain terms like “school” are frequently used by younger twitter users. So we focused on terminology and selected a few terms in different domains that could give us a good idea of what the youth are thinking about.

Some major words popular among the youngsters. Source: Omdena.com

But just selecting words wasn’t enough, as these categories could also contain tweets from older users, though not in the majority, but undesirable nonetheless. We developed some methods of separating the tweets by age:

  • Use of youth-specific hashtags like #teen, #graduation, and #school
  • Use of slang words. Examples of slang words used by youth collected by our team: Sic, Thirsty, TBH, BTW, YOLO, Bruh, Bae.
  • High frequency of emojis in text
  • Greater use of informal abbreviations like kinda and sorta.

Methodology

For performing the topic modeling, we explored methods like NMF(Non-negative Matrix factorization), LDA(Latent Derilicht Analysis), and TF-IDF analysis. To analyze the sentiment, we performed sentiment analysis on the cleaned tweets using deep learning and a Python library called TextBlob.

Our Observations

We plotted some interesting visualizations to help present our insights using Seaborn, Matplotlib, Tableau, HTML, and more Python libraries. Using the data of tweets about popular celebrities, we plotted word clouds to find the most popular celebrities among the youth.

Celebrity Word Clouds Source: Omdena.com

We wished to find how the youth felt on important topics like Youth Activism and Racism and created bar plots depicting the frequency for each category of sentiment.

Source: Omdena.com

The polarity of the sentiment found in the tweets was also used in a special GIS analysis workflow, using the coordinates of the tweets obtained from the dataset. We located the deviation in the polarity of the sentiment over 10 countries, which demonstrates how far apart the extremes in sentiment were, over 10 years from 2010 to 2020.

Std Deviation in Sentiment over region Source: Omdena.com

The above plot shows country-wise standard deviation in polarity for sentiment analysis of tweets. France had the maximum deviation, and thus most change in sentiment over time.

The recent COVID-19 pandemic has caused a global shift in sentiment, particularly affecting the youth. We created a sentiment heat map to visualize this change in sentiment.

Source: Omdena.com

We can see that a lot of negative sentiment was floating around Twitter in June and January, with the maximum negativity in June.

We also wanted to observe the overall positive sentiment across the globe, to examine which country had the happiest youth population. Using the tool — Folium, we created positive sentiment heat maps, from the cleaned tweets of the youth data.

Heat Maps in Positive Sentiment over region Source: Omdena.com

We observed that Europe and South Africa are major centers for positive sentiment.

Conclusion

From this challenge, we got some very exciting conclusions that provided enormous insight into the sentiment of the youth:

  1. Among the influencers, Taylor Swift has good positive tweets surrounding her online presence.
  2. From 2013 to 2020, the students had increasing negativity about the school, which is a concern and may be due to the increasing number of school shootings, the declining education standards, and bullying in general.
  3. From 2018 to 2020, the tweets(or awareness) about racism increased rapidly which may be attributed to the Black Lives Matter Movement, that started in the USA in 2020 and spread to the entire world.

Overall, we can see that the future leaders of the world are well-informed individuals, with strong opinions about the different aspects of their lives, and working upon their concerns will help us improve their lives and let them reach their fullest potential.

Sources

  1. Morgan-Lopez AA, Kim AE, Chew RF, Ruddle P (2017) Predicting age groups of Twitter users based on language and metadata features. PLoS ONE 12(8): e0183537. https://doi.org/10.1371/journal.pone.0183537
  2. Faiyaz Al Zamal, Wendy Liu, and Derek Ruths. 2012. Homophily and Latent Attribute Inference: Inferring Latent Attributes of Twitter Users from Neighbors. In Proceedings of ICWSM.
  3. https://towardsdatascience.com/twitter-demographics-user-age-inference-82ad7bf65229

Acknowledgment

This challenge was one of the most exciting AI problems I had ever dealt with and working with Omdena helped me discover my path in AI for social good. As the team leads for the Twitter task, I developed strong leadership skills along with processing the challenging real-world datasets. My brilliant team was a devoted bunch of researchers, constantly striving to share innovative ideas and strengthen our conclusions. I would like to thank Rudradeb Mitra for this amazing platform that allows budding scientists to explore AI in a healthy collaborative environment.

--

--

Aboli Marathe
Omdena
Writer for

Machine Learning Engineer @ Omdena| AI for Social Good | Writer @ The Innovation