Decoding Generations Through Social Media: A Deep Dive into Data Science, Deep Learning, and AI

Chen Jun Ming
AI Palette
Published in
3 min readJun 11, 2023
Three generations of people on their phones, 3d render, created with DALL·E 2

Introduction

Imagine being able to accurately predict the generation of the author behind any given social media post. It’s a challenging concept, even for humans. However, with the evolving field of data science and the ever-growing capabilities of deep learning and artificial intelligence, it’s an ambitious goal we decided to tackle.

The project’s objective was straightforward: to predict whether a social media post was written by a member of Gen Z, a Millennial, or Gen X+ using Natural Language Processing (NLP).

The Challenges

This ambitious endeavour was not without its challenges. One of the significant obstacles was the lack of readily available datasets. Twitter, one of the main social media platforms, does not publicly share information about its users, making it impossible to easily collect data for social media posts by age.

To complicate things further, there was the challenge of adhering to the General Data Protection Regulation (GDPR). The GDPR prohibits storing user identifiable information, thus ruling out the possibility of collecting data directly from user profiles.

The Innovative Solution

Faced with these challenges, we needed to employ a creative solution. We decided to scrape Twitter for ‘Happy Birthday’ posts that contained a numerical value, indicating the user’s age. We manually validated each post to ensure its authenticity and excluded any posts related to businesses or influencers. We also discarded posts that contained inaccurate information, such as a person celebrating the 20th anniversary of their 25th birthday.

This process allowed us to collect tweets and label them by the user’s age without storing the username in the dataset, thereby bypassing any potential GDPR violation.

Once we collected the data, we categorized the ages into their respective generations: Gen Z, Millennials, and Gen X+. The final step was to train a deep learning NLP model, XLM-RoBERTa, to classify which generation the social media post originated from. The advantage of XLM-RoBERTa is that it’s language-agnostic and can handle emojis, which are frequently used in social media communication.

Design Choices

We wanted to keep the system simple and decided to focus on three generations: Gen Z, Millennials, and Gen X+. The Gen X+ category is a combination of Gen X and Baby Boomers, resulting in a skewed age distribution as it encompasses a larger age group than the others.

Gen Z, on the other hand, is underrepresented because there are fewer children under the age of 16 on social media platforms.

Outcome and Potential Improvements

The model performed with an impressive accuracy of about 67% on the test data. When tested on data from our own platform, the accuracy was consistent, ranging between 60 to 70% for various groupings of data.

While this is a substantial achievement, we see potential for improvement. In retrospect, we realized that other factors, such as race and geography, significantly influence the linguistic patterns of a social media post. For future enhancements, we plan to incorporate region filters into the dataset when focusing on a specific geographical area.

Conclusion

Our journey into decoding generations through social media posts using data science, deep learning, and artificial intelligence has been a challenging but rewarding endeavour. We’ve managed to develop an innovative solution that not only respects data protection regulations but also achieves impressive predictive accuracy.

As we continue to refine the model and incorporate other factors such as race and geography, we’re excited to see how much closer we can get to unraveling the complex linguistic tapestry of social media, one generation at a time.

This article was written by Jun Ming, who is a Junior Data Scientist at Ai Palette.

--

--