How to list the most common words from text corpus using Scikit-Learn?

Frequently we want to know which words are the most common from a text corpus sinse we are looking for some patterns.


vec = CountVectorizer().fit(corpus)

Here we get a Bag of Word model that has cleaned the text, removing non-aphanumeric characters and stop words.

bag_of_words = vec.transform(corpus)

bag_of_words a matrix where each row represents a specific text in corpus and each column represents a word in vocabulary, that is, all words found in corpus. Note that bag_of_words[i,j] is the occurrence of word j in the text i.

sum_words = bag_of_words.sum(axis=0)

sum_words is a vector that contains the sum of each word occurrence in all texts in the corpus. In other words, we are adding the elements for each column of bag_of_words matrix.

words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)

Finally we sort a list of tuples that contain the word and their occurrence in the corpus.

Use case

I have a list of cars for sell ads title composed by its year of manufacture, car manufacturer and model. You can download the dataset from here.

cars_for_sell = [line.replace("\n", "") for line in open("cars_for_sell.txt")]
['2017 GMC Sierra 1500', '2010 Toyota Sienna', '2016 Volkswagen Beetle', '2011 Dodge Ram', '2003 Land-Rover Range Rover']

Now I want to get the top 20 common words:

ford 200
chevrolet 181
2013 84
2014 78
toyota 67
2015 67
2012 66
dodge 60
2010 58
2011 55
2016 54
2008 54
2007 53
2006 49
nissan 49
bmw 46
ram 44
2009 44
silverado 42
honda 42

Seems to be that we found interesting things:

  • There are mostly Ford and Chevrolets cars for sell.
  • There are greater cars manufactured in 2013 and 2014 for sell.

Hey! Do you encourage to plot a histogram?