Tokenizing, Sequencing and Padding a Sarcasm dataset

Published in

Analytics Vidhya

2 min readJul 10, 2020

In the previous articles we have discussed Tokenizing, Sequencing and Padding the sentences…now we will apply those methods on a real dataset.

News Headlines Dataset For Sarcasm Detection dataset —

Each record consists of three attributes:

2. headline: the headline of the news article

3. article_link: link to the original news article. Useful in collecting supplementary data

Follow this link to know more about the dataset…Kaggle

Now we shall see how to apply the methods we have learned

Loading the dataset and creating 3 lists to store ‘article_link’, ‘headline’ and ‘url’ info from each data point.

2. Tokenizing, Sequencing and Padding the sentences list.

The length of word_index is 29657. We can see a padded sentence of size 40 i.e the largest sequence is of length 40.

In the upcoming article we will work on exploring BBC News Archive!

Written by DataOil St.