Torch Text :Not so Popular Library

Published in

Analytics Vidhya

2 min readAug 15, 2021

What is Torch Text ?

If You would have done some basic preprocessing task in NLP then you might be aware about NLTK , Spacy , Textblob , Stanford NLP all these are the powerful libraries to do text preprocessing. But in case of pytorch we have special library which takes care of all the text preprocessing step starting from loading data to embedding the text .

from torchtext.legacy.data import Field , TabularDataset
from torchtext.legacy.data import BucketIterator

Field will specify how the preparation would be done. TabularDataset is used to load dataset from different file format like json,csv,tsv,text etc and BucketIterator helps to do batching and padding in a given dataset. Let’s Take an example to understand the given Library. Let’s get started………

#overview of data
import pandas as pd
data=pd.read_csv('train.csv')
data.head()

tokens=lambda x:x.split()
comment=Field(sequential=True,use_vocab=True,tokenize=tokens,  
                                              lower=True)
score=Field(sequential=False,use_vocab=False)
fields={'comments':('c',comment),'score':('s',score)}

Sequential If this will be true then it will tokenize the text otherwise the tokenization won’t be performed. The columns in variable fields would be in same forma as it would have been taken in dataset.

train_set,test_set=TabularDataset.splits(path='/content',
                                          format='csv',
                                          train='train.csv',
                                          test='test.csv',
                                          fields=fields)

With the help of TabularDataset I loaded the dataset and splitted into train and test set.

comment.build_vocab(train_set,
                    max_size=100,
                    min_frequency=1)

here max_size allows us to select limited vocabulary size. Means we will build vocabulary for top 100 occurring words only and min_freq=2 gives the flexibility to build the vocabulary for only those word which at least occurs two times in a given dataset.

train_iter,test_iter=BucketIterator.splits((train_set,test_set),
                                            batch_size=2)
for i in train_iter:
    print(i.c,i.s)

Conclusion:-

This is all from my side .Please give you valuable suggestions in comment box .The complete credit of this blog goes to Aladdin Persson.

Torch Text :Not so Popular Library

Written by akhil anand