TweetEval — A Standardized Evaluation Setup for Diverse Tweet Classification Tasks

Published in

Lexiconia

3 min readJun 29, 2024

TweetEval is a unified benchmark and comparative evaluation framework for tweet classification tasks. The key objective of this work is to provide a standardized evaluation setup that facilitates comparisons between different tweet classification models. The benchmark consolidates and curates tweets from various existing datasets, covering 6 common tweet classification tasks — emotion, hate speech, irony, offensive language, sentiment, and stance detection. In total, the TweetEval dataset contains over 150,000 tweets, with each task having between 5,000 to 35,000 labeled tweets that cover a diverse range of topics and styles representative of real-world Twitter data.

TweetEval: Unified Benchmark and Comparative Evaluation for Tweet Classification

Francesco Barbieri, Jose Camacho-Collados, Luis Espinosa Anke, Leonardo Neves. Findings of the Association for…

aclanthology.org

The TweetEval paper defines a common evaluation protocol across the 6 tasks, using standard metrics like accuracy and F1-score. The paper also provides baseline results using popular tweet classification models like BERT, XLNet, and RoBERTa. The benchmarking results demonstrate that there is still room for improvement, with the best models achieving 60–80% F1-scores on average across the different tasks.

Papers with Code - TweetEval Dataset

TweetEval introduces an evaluation framework consisting of seven heterogeneous Twitter-specific classification tasks.

paperswithcode.com

By serving as a comprehensive, standardized evaluation framework, TweetEval is a significant contribution that advances research in tweet classification. It allows researchers to compare the performance of different models on the same dataset, thereby facilitating progress in this important area of understanding and moderating online discourse. The dataset and evaluation protocol are publicly available, encouraging further research and development of effective tweet classification systems.

notes/barbieri2010-tweeteval.md at master · makrai/notes

Notes on papers in Natural Language Processing, Computational Linguistics, and the related sciences …

github.com

The cardiffnlp/tweeteval repository on GitHub (https://github.com/cardiffnlp/tweeteval/tree/main) is a comprehensive resource for the TweetEval benchmark, which is a unified evaluation framework for seven different Twitter-related natural language processing tasks, including emotion recognition, emoji prediction, irony detection, hate speech detection, offensive language identification, sentiment analysis, and stance detection. The repository provides the datasets for these tasks, a leaderboard showcasing the performance of various models, and instructions for evaluating your own system, making it a valuable resource for researchers and practitioners working on social media analysis and understanding user-generated content on platforms like Twitter.

GitHub - cardiffnlp/tweeteval: Repository for TweetEval

Repository for TweetEval. Contribute to cardiffnlp/tweeteval development by creating an account on GitHub.

github.com

The repository contains the following information: (1) TweetEval:The Benchmark and (2) TweetEval: Leaderboard (Test set)

TweetEval: The Benchmark

These are the seven datasets of TweetEval, with its corresponding labels (more details about the format in the datasets directory):

Emotion Recognition: SemEval 2018 — Emotion Recognition (Mohammad et al., 2018) — 4 labels: anger, joy,sadness, optimism
Emoji Prediction, SemEval 2018 — Emoji Prediction (Barbieri et al., 2018) — 20 labels: ❤️, 😍, 😂 ... 🌲, 📷, 😜
Irony Detection, SemEval 2018 — Irony Detection (Van Hee et al., 2018) — 2 labels: irony, not irony
Hate Speech Detection, SemEval 2019 — Hateval (Basile et al., 2019) — 2 labels: hateful, not hateful
Offensive Language Identification, SemEval 2019 — OffensEval (Zampieri et al., 2019)- 2 labels: offensive, not offensive
Sentiment Analysis*, SemEval 2017 — Sentiment Analysis in Twitter (Rosenthal et al., 2019) — 3 labels: positive, neutral, negative
Stance Detection*, SemEval 2016 — Detecting Stance in Tweets (Mohammad et al., 2016) — 3 labels: favour, neutral, against

Note 1*: For stance there are five different target topics (Abortion, Atheism, Climate change, Feminism and Hillary Clinton), each of which contains its own training, validation and test data.

Note 2*: The sentiment dataset has been updated as of 17 December 2020. The update has been minimal and it was intended to fix a small number of sentences that were cropped.

TweetEval: Leaderboard (Test set)