The Sentiment140 Dataset: A Benchmark for Sentiment Classification

Mohamad Mahmood
Lexiconia
Published in
2 min readJun 21, 2024

Sentiment140 is a widely-used dataset of 1.6 million tweets labeled with positive, negative, or neutral sentiment, created in 2009 by Stanford researchers to help develop and evaluate sentiment analysis models; the dataset contains 800,000 positive tweets, 800,000 negative tweets, and 11,000 neutral tweets automatically and manually labeled, and has become a standard benchmark for sentiment classification research over the past decade, enabling significant advancements in the field of natural language processing.

[1] Get dataset from HuggingFace website

Get dataset

test_url='https://huggingface.co/datasets/stanfordnlp/sentiment140/resolve/refs%2Fconvert%2Fparquet/sentiment140/test/0000.parquet'
!wget {test_url}

train_url='https://huggingface.co/datasets/stanfordnlp/sentiment140/resolve/refs%2Fconvert%2Fparquet/sentiment140/train/0000.parquet'
!wget {train_url}

Load into dataframe

df_raw = pd.read_parquet('0000.parquet')
df_raw.info()
df_raw

Output:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1600000 entries, 0 to 1599999
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 text 1600000 non-null object
1 date 1600000 non-null object
2 user 1600000 non-null object
3 sentiment 1600000 non-null int32
4 query 1600000 non-null object
dtypes: int32(1), object(4)
memory usage: 54.9+ MB

[2] Get dataset from StanfordNLP website

Get Dataset

url='https://cs.stanford.edu/people/alecmgo/trainingandtestdata.zip'
!wget {url}
!unzip trainingandtestdata.zip

Load into dataframe

cols = ['sentiment','id','date','query_string','user','text']
df_processed = pd.read_csv('training.1600000.processed.noemoticon.csv',
header=None, names=cols,encoding="latin-1")
df_processed.info()

Output:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1600000 entries, 0 to 1599999
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 sentiment 1600000 non-null int64
1 id 1600000 non-null int64
2 date 1600000 non-null object
3 query_string 1600000 non-null object
4 user 1600000 non-null object
5 text 1600000 non-null object
dtypes: int64(2), object(4)
memory usage: 73.2+ MB

Reference:

“Twitter Sentiment Classification with Distant Supervision” https://cs.stanford.edu/people/alecmgo/papers/TwitterDistantSupervision09.pdf

--

--

Mohamad Mahmood
Lexiconia

Programming (Mobile, Web, Database and Machine Learning). Studies at the Center For Artificial Intelligence Technology (CAIT), FTSM, UKM, Malaysia.