How to load text datasets before you’re in trouble with them

Published in

chakki

2 min readMay 22, 2017

As you know, natural language processing starts from data preparation. Data preparation includes searching datasets on the web, download them and writing some script for loading them.

Unfortunately, you had to write different script for each datasets because each datasets has different directory structure, data format and so on.

Now, We Don’t Need to Do It.

We developed a tool to easily handle text datasets. The name is chazutsu (tea canister). The features are: downloading datasets, loading datasets, easily access to it as pandas object and splitting train/text files.

chakki-works/chazutsu

chazutsu - The tool to make NLP datasets ready to use

github.com

If you Star the repository, It’s very encouraging!

Installation

To install chazutsu, simply:

$ pip install chazutsu

Satisfaction, guaranteed.

Usage

You need only one line (except import) to download a dataset. As an example, let’s download Movie Review Data. You have only to call download method to download it:

import chazutsu
r = chazutsu.datasets.MovieReview.polarity().download()

During downloading, you can see following log. When download are completed, you can see “review_polarity_train.txt” and “review_polarity_test.txt” are created:

Make directory for downloading the file to /your/current/directory
Begin downloading the Moview Review Data dataset from http://www.cs.cornell.edu/people/pabo/movie-review-data/review_polarity.tar.gz.
The dataset file is saved to /your/current/directory/data/moview_review_data/review_polarity.tar.gz
...
Done all process! Make below files at /your/current/directory/data/moview_review_data
 review_polarity_test.txt
 review_polarity_train.txt

Off course, chazutsu supports data loading. you only need to call train_data/test_data method to access it. You can access data as pandas object. Also, you can split data into label and text data:

>>>r.train_data().head(5)
   polarity                                             review
0         0  plot : a little boy born in east germany ( nam...
1         0  when i arrived in paris in june , 1992 , i was...
2         0   idle hands  is distasteful , crass and deriva...
3         0  phaedra cinema , the distributor of such never...
4         0  one-sided " doom and gloom " documentary about...
>>> target, data = r.train_data(split_target=True)
>>> target.head(3)
0    0
1    0
2    0
Name: polarity, dtype: int64
>>> data.head(3)
                                              review
0  plot : a little boy born in east germany ( nam...
1  when i arrived in paris in june , 1992 , i was...
2   idle hands  is distasteful , crass and deriva...

There are other useful features. For more information, check the repository.

Supported Datasets

The currently supported datasets are as follows:

Sentiment Analysis

Text classification

Language Modeling

By using chazutsu as shown above, you can focus on developing models for natural language processing without having to prepare data. I hope your works will go well by using chazutsu!

chakki-works/chazutsu

chazutsu - The tool to make NLP datasets ready to use

github.com

If you Star the repository, It’s very encouraging!

How to load text datasets before you’re in trouble with them

chakki-works/chazutsu

chazutsu - The tool to make NLP datasets ready to use

Installation

Usage

Supported Datasets

chakki-works/chazutsu

chazutsu - The tool to make NLP datasets ready to use

Written by Hiroki Nakayama