How to list the most common words from text corpus using Scikit-Learn?
Frequently we want to know which words are the most common from a text corpus sinse we are looking for some patterns.
vec = CountVectorizer().fit(corpus)
Here we get a Bag of Word model that has cleaned the text, removing non-aphanumeric characters and stop words.
bag_of_words = vec.transform(corpus)
bag_of_words a matrix where each row represents a specific text in corpus and each column represents a word in vocabulary, that is, all words found in corpus. Note that bag_of_words[i,j] is the occurrence of word j in the text i.
sum_words = bag_of_words.sum(axis=0)
sum_words is a vector that contains the sum of each word occurrence in all texts in the corpus. In other words, we are adding the elements for each column of bag_of_words matrix.
words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
words_freq =sorted(words_freq, key = lambda x: x, reverse=True)
Finally we sort a list of tuples that contain the word and their occurrence in the corpus.
I have a list of cars for sell ads title composed by its year of manufacture, car manufacturer and model. You can download the dataset from here.
cars_for_sell = [line.replace("\n", "") for line in open("cars_for_sell.txt")]print(cars_for_sell[:5])
['2017 GMC Sierra 1500', '2010 Toyota Sienna', '2016 Volkswagen Beetle', '2011 Dodge Ram', '2003 Land-Rover Range Rover']
Now I want to get the top 20 common words:
Seems to be that we found interesting things:
- There are mostly Ford and Chevrolets cars for sell.
- There are greater cars manufactured in 2013 and 2014 for sell.