Loading and Exploring a Dataset from Hugging Face

Yuan An, PhD
3 min readSep 27, 2023

--

This is a series of short tutorials about using Hugging Face. The table of contents is here.

In this lesson, we will learn how to work with Hugging Face datasets.

About Hugging Face Datasets

In addition to the “transformers” library, Hugging Face also offers the “datasets” library, which aims to facilitate the process of loading and processing datasets for machine learning and NLP.

The library provides access to a large number of datasets from various domains, making it easier for researchers and developers to access and use them without the hassles of manual data downloading and preprocessing.

One of the significant advantages of the “datasets” library is its unified API. No matter the original format or structure of the data, users can load and manipulate datasets in a consistent manner.

The library allows for efficient and fast data loading, especially with large datasets.

The library offers tools for dataset preprocessing, such as tokenization, which is crucial for NLP tasks. It integrates well with the Hugging Face “transformers” library, making it seamless to prep data for model training.

In addition to public datasets, the library also allows users to load their own custom datasets and process them in the same consistent manner.

It supports dataset versioning, ensuring that users can specify and access particular versions of datasets when needed.

To get started with the Hugging Face “datasets” library, it can be installed using pip:

! pip install datasets

Load a Tweet Dataset for Sentiment Analysis

To find a dataset, we access the Hugging Face Datasets Webpage and type ‘tweet sentiment’ in the search box.

There is a list of datasets matching our search criteria. We will explore the ‘SetFit/tweet_sentiment_extraction’ dataset. We first import the load_dataset() function from ‘datasets’ and then load the dataset:

from datasets import load_dataset

dataset = load_dataset("SetFit/tweet_sentiment_extraction")

List the Metadata and Content of the Dataset

We can list the metadata of the object we just loaded as:

dataset

The result shows:

DatasetDict({
train: Dataset({
features: ['textID', 'text', 'label', 'label_text'],
num_rows: 27481
})
test: Dataset({
features: ['textID', 'text', 'label', 'label_text'],
num_rows: 3534
})
})

It indicates that the object we just loaded is a dictionary. It has two nested dictionaries: one has the key “train” and a value Dataset and another has the key “test” and a value Dataset.

Both the “train” and “test” datasets have a set of features. We can rename a feature by using the rename_column() function:

dataset = dataset.rename_column('label_text', 'label_name')

If you want to use a Pandas DataFrame to explore the “train” and “test” sets, you can create a Pandas DataFrame as:

import pandas as pd

train_df = pd.DataFrame(dataset["train"])

Index and Slice a Dataset

To index a record from a dataset, we can pass an index to the dataset object as:

# To retrieve the second record from the train dataset
dataset['train'][1]

We can slice a set of records by using a pair of indices as:

# To retrieve the last 50 records from the test dataset
dataset['test'][-50:]

Create a Train and Test Set from a Dataset

This dataset has been split into a “train” and “test” set already. In some cases, we need to split a given dataset into “train” and “test” sets. We can use the train_test_split() function as:

# We split the original "train" set into additional "train" and "test" sets
dataset_split = dataset['train'].train_test_split(test_size=0.2)

As a result, dataset_split[“train”] contains the following features and number of records/rows:

Dataset({
features: ['textID', 'text', 'label', 'label_name'],
num_rows: 21984
})

And dataset_split[“test”] contains the following features and number of records/rows:

Dataset({
features: ['textID', 'text', 'label', 'label_name'],
num_rows: 5497
})

The colab notebook is available here:

--

--

Yuan An, PhD

Faculty member in the College of Computing and Informatics at Drexel University; Doing research in NLP, Machine Learning, Ontology, Knowledge Graph, Embeddings