Akshay Sharma
The HumAIn Blog
Published in
3 min readFeb 1, 2021

--

Brief Introduction to N-gram and TF-IDF | Tokenization

INTRODUCTION

In this article we will try to analyze the same data set with TF-IDF and then N-gram, we will see the implementation in python and bring forth the comparison to create a simple automatic text generator.

TF-IDF

TF-IDF is a method which gives us a numerical weightage of words which reflects how important the particular word is to a document in a corpus. A corpus is a collection of documents. Tf is Term frequency, and IDF is Inverse document frequency. This method is often used for information retrieval and text mining.

We will take four reviews or the documents as our data corpus and store them in a list.

def IDF(corpus, unique_words):   idf_dict={}   N=len(corpus)   for i in unique_words:      count=0      for sen in corpus:         if i in sen.split():            count=count+1      idf_dict[i]=(math.log((1+N)/(count+1)))+1   return idf_dict

We will determine a function IDF whose parameter will be a list of words and unique words.

We have used TF-IDF to calculate the values of all unique words in the set.

def fit(whole_data):   unique_words = set()   if isinstance(whole_data, (list,)):      for x in whole_data:         for y in x.split():            if len(y)<2:               continue            unique_words.add(y)      unique_words = sorted(list(unique_words))      vocab = {j:i for i,j in enumerate(unique_words)}      Idf_values_of_all_unique_words=IDF(whole_data,unique_words)   return vocab, Idf_values_of_all_unique_wordsVocabulary, idf_of_vocabulary=fit(corpus)

We have initialized ‘unique_words’ as a set to get all the unique values. Words with length less than 2 are discarded. We are also calling IDF function inside the fir function which will give us the idf values of all the unique words and will store them in list unique_words.

This function will return words and their idf values respectively.

N-Gram

Wikipedia defines an N-Gram as “A contiguous sequence of N items from a given sample of text or speech”. Here an item can be a character, a word or a sentence and N can be any integer. When N is 2, we call the sequence a bigram. Similarly, a sequence of 3 items is called a trigram, and so on.

We will create a word level N-Gram model in this section.

  1. First create a dictionary that contains word bigrams as keys and the list of words that occur after the bigrams as values.
  2. Store possible word in ngrams dictionary and frequency of that word in frequency dictionary, that will be helpful later for sorting the dictionary values later.
  3. Iterate through all the words and then join the current 2 words to form a bigrams.
words = 3 # for trigramsfor i in range(len(words_tokens)-words):   seq = ' '.join(words_tokens[i:i+words])   if(seq not in ngrams.keys()):      ngrams[seq] = []      frequency[seq] = 1   frequency[seq] +=1   ngrams[seq].append(words_tokens[i+words])
  1. we check if the word trigram exists in the ngrams dictionary.
  2. If the trigram doesn’t already exist, we simply insert it into the ngrams dictionary as a key.
  3. Create an automatic text generator, using the word trigrams that we just created.
for i in range(50):   if curr_sequence not in ngrams.keys():      break   possible_words = ngrams[curr_sequence]   next_word = possible_words[random.randrange(len(possible_words))]   output += ' ' + next_word

This way we can generate output that is suggestive with possible next words in line.

If you set the value of words variable to 4 (use 4-grams) to generate text, your output will look even more robust.

The N-Gram model is one of the best sentence-to-vector models, since it utilizes the context between N-words in a sentence.

Conclusion

As we try to create a automatic text generator following are the key observations:

  1. In the TF-IDF approach, words that are more common in one sentence and less common in others are given more weights, and since words are treated individually and every single word is converted to numeric form the context information is not retained, whereas N-Grams help us to retain the context.
  2. TF-IDF can be used for generating List with relevant words in sentences and later this can be utilized as input list for the N-Gram model to generate context among these words.

--

--