How to load text datasets before you’re in trouble with them
As you know, natural language processing starts from data preparation. Data preparation includes searching datasets on the web, download them and writing some script for loading them.
Unfortunately, you had to write different script for each datasets because each datasets has different directory structure, data format and so on.
Now, We Don’t Need to Do It.
We developed a tool to easily handle text datasets. The name is chazutsu (tea canister). The features are: downloading datasets, loading datasets, easily access to it as pandas object and splitting train/text files.
If you Star the repository, It’s very encouraging!
Installation
To install chazutsu, simply:
$ pip install chazutsu
Satisfaction, guaranteed.
Usage
You need only one line (except import) to download a dataset. As an example, let’s download Movie Review Data. You have only to call download method to download it:
import chazutsu
r = chazutsu.datasets.MovieReview.polarity().download()
During downloading, you can see following log. When download are completed, you can see “review_polarity_train.txt” and “review_polarity_test.txt” are created:
Make directory for downloading the file to /your/current/directory
Begin downloading the Moview Review Data dataset from http://www.cs.cornell.edu/people/pabo/movie-review-data/review_polarity.tar.gz.
The dataset file is saved to /your/current/directory/data/moview_review_data/review_polarity.tar.gz
...
Done all process! Make below files at /your/current/directory/data/moview_review_data
review_polarity_test.txt
review_polarity_train.txt
Off course, chazutsu supports data loading. you only need to call train_data/test_data method to access it. You can access data as pandas object. Also, you can split data into label and text data:
>>>r.train_data().head(5)
polarity review
0 0 plot : a little boy born in east germany ( nam...
1 0 when i arrived in paris in june , 1992 , i was...
2 0 idle hands is distasteful , crass and deriva...
3 0 phaedra cinema , the distributor of such never...
4 0 one-sided " doom and gloom " documentary about...
>>> target, data = r.train_data(split_target=True)
>>> target.head(3)
0 0
1 0
2 0
Name: polarity, dtype: int64
>>> data.head(3)
review
0 plot : a little boy born in east germany ( nam...
1 when i arrived in paris in june , 1992 , i was...
2 idle hands is distasteful , crass and deriva...
There are other useful features. For more information, check the repository.
Supported Datasets
The currently supported datasets are as follows:
Sentiment Analysis
Text classification
Language Modeling
By using chazutsu as shown above, you can focus on developing models for natural language processing without having to prepare data. I hope your works will go well by using chazutsu!
If you Star the repository, It’s very encouraging!