The Sentiment140 Dataset: A Benchmark for Sentiment Classification

Published in

Lexiconia

2 min readJun 21, 2024

Sentiment140 is a widely-used dataset of 1.6 million tweets labeled with positive, negative, or neutral sentiment, created in 2009 by Stanford researchers to help develop and evaluate sentiment analysis models; the dataset contains 800,000 positive tweets, 800,000 negative tweets, and 11,000 neutral tweets automatically and manually labeled, and has become a standard benchmark for sentiment classification research over the past decade, enabling significant advancements in the field of natural language processing.

[1] Get dataset from HuggingFace website

Get dataset

test_url='https://huggingface.co/datasets/stanfordnlp/sentiment140/resolve/refs%2Fconvert%2Fparquet/sentiment140/test/0000.parquet'
!wget {test_url}

train_url='https://huggingface.co/datasets/stanfordnlp/sentiment140/resolve/refs%2Fconvert%2Fparquet/sentiment140/train/0000.parquet'
!wget {train_url}

Load into dataframe

df_raw = pd.read_parquet('0000.parquet')
df_raw.info()
df_raw

Output:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1600000 entries, 0 to 1599999
Data columns (total 5 columns):
 #   Column     Non-Null Count    Dtype 
---  ------     --------------    ----- 
 0   text       1600000 non-null  object
 1   date       1600000 non-null  object
 2   user       1600000 non-null  object
 3   sentiment  1600000 non-null  int32 
 4   query      1600000 non-null  object
dtypes: int32(1), object(4)
memory usage: 54.9+ MB

[2] Get dataset from StanfordNLP website

Get Dataset

url='https://cs.stanford.edu/people/alecmgo/trainingandtestdata.zip'
!wget {url}
!unzip trainingandtestdata.zip

Load into dataframe

cols = ['sentiment','id','date','query_string','user','text']
df_processed = pd.read_csv('training.1600000.processed.noemoticon.csv',
                        header=None, names=cols,encoding="latin-1")
df_processed.info()

Output:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1600000 entries, 0 to 1599999
Data columns (total 6 columns):
 #   Column        Non-Null Count    Dtype 
---  ------        --------------    ----- 
 0   sentiment     1600000 non-null  int64 
 1   id            1600000 non-null  int64 
 2   date          1600000 non-null  object
 3   query_string  1600000 non-null  object
 4   user          1600000 non-null  object
 5   text          1600000 non-null  object
dtypes: int64(2), object(4)
memory usage: 73.2+ MB

Reference:

“Twitter Sentiment Classification with Distant Supervision” https://cs.stanford.edu/people/alecmgo/papers/TwitterDistantSupervision09.pdf

The Sentiment140 Dataset: A Benchmark for Sentiment Classification

[1] Get dataset from HuggingFace website

[2] Get dataset from StanfordNLP website

Reference:

Written by Mohamad Mahmood