Term Frequency Text Summarization

Interview Hacking via NLP: The Basics

Austin Robinson
Analytics Vidhya
Published in
4 min readAug 6, 2020

--

I was recently doing interview prep for a series of interviews for a position with Collective Health, a San Francisco-based healthcare technology start-up.

They run a series on their blog titled “Meet The Collective,” consisting of interviews with a broad cross-section of their employees. I was interested in knowing what the common threads are for employees; why did they come to CH? What do they enjoy about it? What are those employees unifying experiences?

Reading a dozen lengthy interviews wasn’t the most attractive proposition though. I decided to implement a very simple approach to Text Summarization to pare down the interviews into the most important bits; Weighted Frequency of Occurrence. In short, we find the frequency that each word occurs in the text, weight every sentence by the frequency of the words it contains, and return only the n highest-weighted sentences. In this article, I’m going to walk you through that process.

Imports

We’ll be using Requests to pull the information we want from the blog; BeautifulSoup and regex to clean the text; NLTK to process the text; WordCloud and MatPlotLib to visualize; and heapq to help parse our NLTK results.

Since the focus here is on the NLP, not the web scraping, we’ll be iterating over some links I’ve already gathered; a collection of “Meet The Collective” articles.

Preprocessing

For every link:

  • We use BeautifulSoup and requests to pull only the actual text of the interview (contained within a certain layer of HTML <p> tags) out of the webpage.
  • We use regex to fix and unify some erroneous punctuation; this helps prevent sentences from bleeding into each other later.
  • We add the text from that link to our list of text.

Finally, we join our list of article texts into a single body of text.

NLP!

Now that our corpus is prepared, we can begin in earnest.

Tokenizing

Here we’re using NLTK to tokenize our text in two different ways; by word, and by sentence.

Weighted Frequency

NLTK allows us to easily count the number of times that each word occurs in our text with nltk.FreqDist(); by dividing the number of times a given word occurs by the total number of words in our text, we can find the weighted frequency that each word occurs.

Scoring and Summarization

This is where the magic really happens!
For every sentence:

  • Only consider the sentence if it has less than 30 words (we’re summarizing, keep it snappy!);
  • Tokenize every word in that sentence;
  • Sum the weighted frequencies of every tokenized word in the sentence.

Once that’s done, we’re using heapq to pull the sentences with the largest scores. In this case, the first fourteen highest-scored sentences happen to be the questions CH asks in every “Meet The Collective” interview, so we’re pulling the top twenty-nine sentences; the common questions, and a fifteen-sentence summary of the interviews.

Success!

We can now see what we’ve learned; specifically, we can see what questions CH asks in every employee interview, and a summary of what those employees tend to say. I’ve put a selection of both below; the full results are available in my notebook here.

Questions

As a kid, what did you want to be when you grew up and how does that inform what you do today.

What’s one of the most important lessons you’ve learned in your career.

What happened in your career that led you to Collective Health.

How would you explain what you do to a 5-year-old.

What excites you most about where Collective Health is going.

Summary

Employees’ minds tend to go to the perks of a company, but culture shows up in the minutiae of the day to day.

The fact that we can’t live up to that is pretty sad, so I love the idea of working for a company that’s going to make it better.

Coming from one of the most valued and impactful companies in the world, I wanted to work on something that I felt was destined to make a mark.

Some of the things that attracted me were the people I met during my interview process and the mission of the company.

I think it’s really important to get into the practice of taking time for yourself so that you can bring your best self to work and do great work.

Bonus: Visualizing!

Here’s a quick, neat little visualization of our word frequencies.

We’re using NLTK to help filter our words and find the frequencies, and heapq to pull the forty most-common words; these get fed into WordCloud to generate our image.

Which gives us…

This visualization, together with our summary, gives us quick and valuable insight into what employees at Collective Health are saying about the company and what brought them there; people, culture, mission, and impact.

Thanks for reading! If you’d like to explore my work further, the notebook for this project can be found here, and my GitHub is here.

--

--

Austin Robinson
Analytics Vidhya

🥑 Developer Advocate | 📝 Data Scientist | 🎺 Musician | 🛸 UFO Enthusiast