Loading and Exploring a Dataset from Hugging Face
This is a series of short tutorials about using Hugging Face. The table of contents is here.
In this lesson, we will learn how to work with Hugging Face datasets.
About Hugging Face Datasets
In addition to the “transformers” library, Hugging Face also offers the “datasets” library, which aims to facilitate the process of loading and processing datasets for machine learning and NLP.
The library provides access to a large number of datasets from various domains, making it easier for researchers and developers to access and use them without the hassles of manual data downloading and preprocessing.
One of the significant advantages of the “datasets” library is its unified API. No matter the original format or structure of the data, users can load and manipulate datasets in a consistent manner.
The library allows for efficient and fast data loading, especially with large datasets.
The library offers tools for dataset preprocessing, such as tokenization, which is crucial for NLP tasks. It integrates well with the Hugging Face “transformers” library, making it seamless to prep data for model training.
In addition to public datasets, the library also allows users to load their own custom datasets and process them in the same consistent manner.
It supports dataset versioning, ensuring that users can specify and access particular versions of datasets when needed.
To get started with the Hugging Face “datasets” library, it can be installed using pip:
! pip install datasets
Load a Tweet Dataset for Sentiment Analysis
To find a dataset, we access the Hugging Face Datasets Webpage and type ‘tweet sentiment’ in the search box.
There is a list of datasets matching our search criteria. We will explore the ‘SetFit/tweet_sentiment_extraction’ dataset. We first import the load_dataset() function from ‘datasets’ and then load the dataset:
from datasets import load_dataset
dataset = load_dataset("SetFit/tweet_sentiment_extraction")
List the Metadata and Content of the Dataset
We can list the metadata of the object we just loaded as:
dataset
The result shows:
DatasetDict({
train: Dataset({
features: ['textID', 'text', 'label', 'label_text'],
num_rows: 27481
})
test: Dataset({
features: ['textID', 'text', 'label', 'label_text'],
num_rows: 3534
})
})
It indicates that the object we just loaded is a dictionary. It has two nested dictionaries: one has the key “train” and a value Dataset and another has the key “test” and a value Dataset.
Both the “train” and “test” datasets have a set of features. We can rename a feature by using the rename_column() function:
dataset = dataset.rename_column('label_text', 'label_name')
If you want to use a Pandas DataFrame to explore the “train” and “test” sets, you can create a Pandas DataFrame as:
import pandas as pd
train_df = pd.DataFrame(dataset["train"])
Index and Slice a Dataset
To index a record from a dataset, we can pass an index to the dataset object as:
# To retrieve the second record from the train dataset
dataset['train'][1]
We can slice a set of records by using a pair of indices as:
# To retrieve the last 50 records from the test dataset
dataset['test'][-50:]
Create a Train and Test Set from a Dataset
This dataset has been split into a “train” and “test” set already. In some cases, we need to split a given dataset into “train” and “test” sets. We can use the train_test_split() function as:
# We split the original "train" set into additional "train" and "test" sets
dataset_split = dataset['train'].train_test_split(test_size=0.2)
As a result, dataset_split[“train”] contains the following features and number of records/rows:
Dataset({
features: ['textID', 'text', 'label', 'label_name'],
num_rows: 21984
})
And dataset_split[“test”] contains the following features and number of records/rows:
Dataset({
features: ['textID', 'text', 'label', 'label_name'],
num_rows: 5497
})
The colab notebook is available here:
The table of contents of the entire course is here: https://medium.com/@anyuanay/tutorials-on-working-with-hugging-face-models-and-datasets-a01dea1f1a81