How to load text datasets before you’re in trouble with them

Hiroki Nakayama
chakki
Published in
2 min readMay 22, 2017

As you know, natural language processing starts from data preparation. Data preparation includes searching datasets on the web, download them and writing some script for loading them.

Unfortunately, you had to write different script for each datasets because each datasets has different directory structure, data format and so on.

Now, We Don’t Need to Do It.

We developed a tool to easily handle text datasets. The name is chazutsu (tea canister). The features are: downloading datasets, loading datasets, easily access to it as pandas object and splitting train/text files.

If you Star the repository, It’s very encouraging!

Installation

To install chazutsu, simply:

$ pip install chazutsu

Satisfaction, guaranteed.

Usage

You need only one line (except import) to download a dataset. As an example, let’s download Movie Review Data. You have only to call download method to download it:

import chazutsu
r = chazutsu.datasets.MovieReview.polarity().download()

During downloading, you can see following log. When download are completed, you can see “review_polarity_train.txt” and “review_polarity_test.txt” are created:

Make directory for downloading the file to /your/current/directory
Begin downloading the Moview Review Data dataset from http://www.cs.cornell.edu/people/pabo/movie-review-data/review_polarity.tar.gz.
The dataset file is saved to /your/current/directory/data/moview_review_data/review_polarity.tar.gz
...
Done all process! Make below files at /your/current/directory/data/moview_review_data
review_polarity_test.txt
review_polarity_train.txt

Off course, chazutsu supports data loading. you only need to call train_data/test_data method to access it. You can access data as pandas object. Also, you can split data into label and text data:

>>>r.train_data().head(5)
polarity review
0 0 plot : a little boy born in east germany ( nam...
1 0 when i arrived in paris in june , 1992 , i was...
2 0 idle hands is distasteful , crass and deriva...
3 0 phaedra cinema , the distributor of such never...
4 0 one-sided " doom and gloom " documentary about...
>>> target, data = r.train_data(split_target=True)
>>> target.head(3)
0 0
1 0
2 0
Name: polarity, dtype: int64
>>> data.head(3)
review
0 plot : a little boy born in east germany ( nam...
1 when i arrived in paris in june , 1992 , i was...
2 idle hands is distasteful , crass and deriva...

There are other useful features. For more information, check the repository.

Supported Datasets

The currently supported datasets are as follows:

Sentiment Analysis

Text classification

Language Modeling

By using chazutsu as shown above, you can focus on developing models for natural language processing without having to prepare data. I hope your works will go well by using chazutsu!

If you Star the repository, It’s very encouraging!

--

--

Hiroki Nakayama
chakki
Editor for

Open source developer. Interested in machine learning and natural language processing. GitHub: https://github.com/Hironsan