TweetEval — A Standardized Evaluation Setup for Diverse Tweet Classification Tasks
TweetEval is a unified benchmark and comparative evaluation framework for tweet classification tasks. The key objective of this work is to provide a standardized evaluation setup that facilitates comparisons between different tweet classification models. The benchmark consolidates and curates tweets from various existing datasets, covering 6 common tweet classification tasks — emotion, hate speech, irony, offensive language, sentiment, and stance detection. In total, the TweetEval dataset contains over 150,000 tweets, with each task having between 5,000 to 35,000 labeled tweets that cover a diverse range of topics and styles representative of real-world Twitter data.
The TweetEval paper defines a common evaluation protocol across the 6 tasks, using standard metrics like accuracy and F1-score. The paper also provides baseline results using popular tweet classification models like BERT, XLNet, and RoBERTa. The benchmarking results demonstrate that there is still room for improvement, with the best models achieving 60–80% F1-scores on average across the different tasks.
By serving as a comprehensive, standardized evaluation framework, TweetEval is a significant contribution that advances research in tweet classification. It allows researchers to compare the performance of different models on the same dataset, thereby facilitating progress in this important area of understanding and moderating online discourse. The dataset and evaluation protocol are publicly available, encouraging further research and development of effective tweet classification systems.
The cardiffnlp/tweeteval
repository on GitHub (https://github.com/cardiffnlp/tweeteval/tree/main) is a comprehensive resource for the TweetEval benchmark, which is a unified evaluation framework for seven different Twitter-related natural language processing tasks, including emotion recognition, emoji prediction, irony detection, hate speech detection, offensive language identification, sentiment analysis, and stance detection. The repository provides the datasets for these tasks, a leaderboard showcasing the performance of various models, and instructions for evaluating your own system, making it a valuable resource for researchers and practitioners working on social media analysis and understanding user-generated content on platforms like Twitter.
The repository contains the following information: (1) TweetEval:The Benchmark and (2) TweetEval: Leaderboard (Test set)
TweetEval: The Benchmark
These are the seven datasets of TweetEval, with its corresponding labels (more details about the format in the datasets directory):
- Emotion Recognition: SemEval 2018 — Emotion Recognition (Mohammad et al., 2018) — 4 labels:
anger
,joy
,sadness
,optimism
- Emoji Prediction, SemEval 2018 — Emoji Prediction (Barbieri et al., 2018) — 20 labels: ❤️, 😍, 😂
...
🌲, 📷, 😜 - Irony Detection, SemEval 2018 — Irony Detection (Van Hee et al., 2018) — 2 labels:
irony
,not irony
- Hate Speech Detection, SemEval 2019 — Hateval (Basile et al., 2019) — 2 labels:
hateful
,not hateful
- Offensive Language Identification, SemEval 2019 — OffensEval (Zampieri et al., 2019)- 2 labels:
offensive
,not offensive
- Sentiment Analysis*, SemEval 2017 — Sentiment Analysis in Twitter (Rosenthal et al., 2019) — 3 labels:
positive
,neutral
,negative
- Stance Detection*, SemEval 2016 — Detecting Stance in Tweets (Mohammad et al., 2016) — 3 labels:
favour
,neutral
,against
Note 1*: For stance there are five different target topics (Abortion, Atheism, Climate change, Feminism and Hillary Clinton), each of which contains its own training, validation and test data.
Note 2*: The sentiment dataset has been updated as of 17 December 2020. The update has been minimal and it was intended to fix a small number of sentences that were cropped.
TweetEval: Leaderboard (Test set)
.