Tokenizing, Sequencing and Padding a Sarcasm dataset

DataOil St.
Analytics Vidhya
Published in
2 min readJul 10, 2020

In the previous articles we have discussed Tokenizing, Sequencing and Padding the sentences…now we will apply those methods on a real dataset.

News Headlines Dataset For Sarcasm Detection dataset —

Each record consists of three attributes:

  1. is_sarcastic: 1 if the record is sarcastic otherwise 0

2. headline: the headline of the news article

3. article_link: link to the original news article. Useful in collecting supplementary data

A single item from the dataset

Follow this link to know more about the dataset…Kaggle

Now we shall see how to apply the methods we have learned

  1. Loading the dataset and creating 3 lists to store ‘article_link’, ‘headline’ and ‘url’ info from each data point.

2. Tokenizing, Sequencing and Padding the sentences list.

The length of word_index is 29657. We can see a padded sentence of size 40 i.e the largest sequence is of length 40.

In the upcoming article we will work on exploring BBC News Archive!

--

--

DataOil St.
Analytics Vidhya

Talks about implementing AI for real-world use cases