Soup to Nuts Analytics with GoogleNews and the 2020 Election

Published in

Slalom Data & AI

4 min readNov 2, 2020

GoogleNews

None of us have ample time to sift through all of the real (?) news and fake/bot news relating to the 2020 Presidential Election and frankly, who would want to?! However, did you know that with a bit of code, we can use the googlenews python package to collect news headlines for specific topics — for instance “Biden” or “Trump”. Let me show you how.

The googlenews Python package is much simpler to use than competing API’s (i.e. newsapi) and data scraping options (i.e. selenium and beautifulsoup). For pulling News Titles across multiple dates, we can use pagination — i.e. the number of pages you want to scan per date. I typically use the top four or five pages to pull the most salient headlines. The code below shows a simple method for looping through dates and paginating for results.

for date in date_range:
    result = []
    googlenews = GoogleNews(start=date, end=date)              googlenews.search(search_text)
    print(“Search Date = “, date)    for i in range(0, num_pages): 
        print(‘Executing GoogleNews call #’, i+1)
        googlenews.getpage(i)
        result_next = googlenews.result()
        print(“Total records returned: “, len(result_next))
        df = pd.DataFrame(result_next)
        df[‘date_calendar’] = datedf_days.append(df)
appended_data = pd.concat(df_days)

The API extractions are quite quick, so retrieving a year’s worth of headlines shouldn’t be too cumbersome, but for heavier content lifts, consider multi-threading for the page requests. Googlenews will pull the following fields (with the “date_calendar” field generated in the for-loop):

Processing

Now that we have sufficient headline data, it’s time for some natural language processing with NLTK. First, we need to perform some basic corpus cleanup. I prefer the approach of performing sentiment analysis on both the headlines and the keywords from our corpus. In this next section of code, the headlines are added to a Python list, then these headlines are tokenized to create bigrams. I also remove tokens that are meaningless for analysis (i.e. “2020” and “election”) along with a few conjunctions. NLTK has countless methods of dealing with text removals and other corpus manipulations.

headlines = df.title.tolist() 
all_bigrams = [] 
headlines_string = (‘ ‘.join(filter(None, headlines))).lower() tokens = word_tokenize(headlines_string) # Remove single letter tokens 
tokens_sans_singles = [i for i in tokens if len(i) > 1] # Remove stop words 
stopwords = nltk.corpus.stopwords.words(‘english’) new_words=(“s’”,”’s”,”election”, “2020”, “n’t”, “wo”,”…”) for i in new_words: 
    stopwords.append(i) tokens_sans_stop = [t for t in tokens_sans_singles if t not in stopwords] # Get bigrams and frequencies 
bi_grams = list(ngrams(tokens_sans_stop, 2)) 
counter = Counter(bi_grams)

There are a few more operations performed to prepare the data for sentiment analysis, not detailed above, which can be viewed in my repo.

NLTK Sentiment

I’m using basic NLTK sentiment analyzer but there are many options out there. In this next section of code, SentimentIntensityAnalyzer is used to calculate sentiment scores for all headlines and also for all bigrams, then these scores are joined back to the main dataframe.

analyzer = SentimentIntensityAnalyzer() 
bigrams_scores = counter_df_sort[‘bigram_joined’].apply(analyzer.polarity_scores).tolist() df_bigrams_scores = pd.DataFrame(bigrams_scores).drop([‘neg’,’neu’,’pos’], axis=1).rename(columns={“compound”: “sentiment_compound”})bigrams_freq_and_scores = counter_df_sort.join(df_bigrams_scores, rsuffix=’_right’)

Soup to Nuts

Now for the front end! Within my script, I specify the minimum date as October 1, 2020 and the maximum date as datetime.today() (November 2, 2020 for this blog release), then execute the script for both search terms,“Biden” and “Trump”, separately. After some fun in Tableau, we can visualize the average headline sentiment over time as well as the most negative and positive bigrams from these headlines. Please keep in mind that this is purely data-driven and I am not claiming support for either candidate.