Torch Text :Not so Popular Library
What is Torch Text ?
If You would have done some basic preprocessing task in NLP then you might be aware about NLTK , Spacy , Textblob , Stanford NLP all these are the powerful libraries to do text preprocessing. But in case of pytorch we have special library which takes care of all the text preprocessing step starting from loading data to embedding the text .
from torchtext.legacy.data import Field , TabularDataset
from torchtext.legacy.data import BucketIterator
Field will specify how the preparation would be done.
TabularDataset is used to load dataset from different file format like
json,csv,tsv,text etc and
BucketIterator helps to do batching and padding in a given dataset. Let’s Take an example to understand the given Library. Let’s get started………
#overview of data
import pandas as pd
Sequential If this will be true then it will tokenize the text otherwise the tokenization won’t be performed. The columns in variable
fields would be in same forma as it would have been taken in dataset.
With the help of
TabularDataset I loaded the dataset and splitted into train and test set.
max_size allows us to select limited vocabulary size. Means we will build vocabulary for top 100 occurring words only and
min_freq=2 gives the flexibility to build the vocabulary for only those word which at least occurs two times in a given dataset.
for i in train_iter:
This is all from my side .Please give you valuable suggestions in comment box .The complete credit of this blog goes to Aladdin Persson.