TweetEval — A Standardized Evaluation Setup for Diverse Tweet Classification Tasks

Mohamad Mahmood
Lexiconia
Published in
3 min readJun 29, 2024

TweetEval is a unified benchmark and comparative evaluation framework for tweet classification tasks. The key objective of this work is to provide a standardized evaluation setup that facilitates comparisons between different tweet classification models. The benchmark consolidates and curates tweets from various existing datasets, covering 6 common tweet classification tasks — emotion, hate speech, irony, offensive language, sentiment, and stance detection. In total, the TweetEval dataset contains over 150,000 tweets, with each task having between 5,000 to 35,000 labeled tweets that cover a diverse range of topics and styles representative of real-world Twitter data.

The TweetEval paper defines a common evaluation protocol across the 6 tasks, using standard metrics like accuracy and F1-score. The paper also provides baseline results using popular tweet classification models like BERT, XLNet, and RoBERTa. The benchmarking results demonstrate that there is still room for improvement, with the best models achieving 60–80% F1-scores on average across the different tasks.

By serving as a comprehensive, standardized evaluation framework, TweetEval is a significant contribution that advances research in tweet classification. It allows researchers to compare the performance of different models on the same dataset, thereby facilitating progress in this important area of understanding and moderating online discourse. The dataset and evaluation protocol are publicly available, encouraging further research and development of effective tweet classification systems.

The cardiffnlp/tweeteval repository on GitHub (https://github.com/cardiffnlp/tweeteval/tree/main) is a comprehensive resource for the TweetEval benchmark, which is a unified evaluation framework for seven different Twitter-related natural language processing tasks, including emotion recognition, emoji prediction, irony detection, hate speech detection, offensive language identification, sentiment analysis, and stance detection. The repository provides the datasets for these tasks, a leaderboard showcasing the performance of various models, and instructions for evaluating your own system, making it a valuable resource for researchers and practitioners working on social media analysis and understanding user-generated content on platforms like Twitter.

The repository contains the following information: (1) TweetEval:The Benchmark and (2) TweetEval: Leaderboard (Test set)

TweetEval: The Benchmark

These are the seven datasets of TweetEval, with its corresponding labels (more details about the format in the datasets directory):

Note 1*: For stance there are five different target topics (Abortion, Atheism, Climate change, Feminism and Hillary Clinton), each of which contains its own training, validation and test data.

Note 2*: The sentiment dataset has been updated as of 17 December 2020. The update has been minimal and it was intended to fix a small number of sentences that were cropped.

TweetEval: Leaderboard (Test set)

.

🤓

--

--

Mohamad Mahmood
Lexiconia

Programming (Mobile, Web, Database and Machine Learning). Studies at the Center For Artificial Intelligence Technology (CAIT), FTSM, UKM, Malaysia.