Torch Text :Not so Popular Library
What is Torch Text ?
If You would have done some basic preprocessing task in NLP then you might be aware about NLTK , Spacy , Textblob , Stanford NLP all these are the powerful libraries to do text preprocessing. But in case of pytorch we have special library which takes care of all the text preprocessing step starting from loading data to embedding the text .
from torchtext.legacy.data import Field , TabularDataset
from torchtext.legacy.data import BucketIterator
Field
will specify how the preparation would be done. TabularDataset
is used to load dataset from different file format like json,csv,tsv,text
etc and BucketIterator
helps to do batching and padding in a given dataset. Let’s Take an example to understand the given Library. Let’s get started………
#overview of data
import pandas as pd
data=pd.read_csv('train.csv')
data.head()
tokens=lambda x:x.split()
comment=Field(sequential=True,use_vocab=True,tokenize=tokens,
lower=True)
score=Field(sequential=False,use_vocab=False)
fields={'comments':('c',comment),'score':('s',score)}
Sequential
If this will be true then it will tokenize the text otherwise the tokenization won’t be performed. The columns in variable fields
would be in same forma as it would have been taken in dataset.
train_set,test_set=TabularDataset.splits(path='/content',
format='csv',
train='train.csv',
test='test.csv',
fields=fields)
With the help of TabularDataset
I loaded the dataset and splitted into train and test set.
comment.build_vocab(train_set,
max_size=100,
min_frequency=1)
here max_size
allows us to select limited vocabulary size. Means we will build vocabulary for top 100 occurring words only and min_freq=2
gives the flexibility to build the vocabulary for only those word which at least occurs two times in a given dataset.
train_iter,test_iter=BucketIterator.splits((train_set,test_set),
batch_size=2)
for i in train_iter:
print(i.c,i.s)
Conclusion:-
This is all from my side .Please give you valuable suggestions in comment box .The complete credit of this blog goes to Aladdin Persson.