Using NLP to get inside Warren Buffet mind part 2

Jair Neto
Analytics Vidhya


Frequency and Sentiment Analysis.

In part 1, I used transformers trained in the task of Questions and Answering to answer the question: “ Can a machine learning model answer questions about Finance and Economy?”. If you haven’t read this article yet, don’t waste any more time and click here to read it.

In this article I will use NLP techniques to answer the questions:

1. What are the most used terms of Buffet and if those terms changed along the time?

2. How was his feeling about the Economy and Stocks Market over the years?

What are the most used terms of Buffet and if those terms changed along the time?

A curiosity that I had was if the speech of Buffet changed over time or always remained the same. So I analyzed how the words that he wrote in the shareholder letters changed over time.

First I combined the text from all the letters ranging from 1977 to 2020 and plotted a word cloud. Word cloud is a visual representation where the size of the words indicates their frequency in the text.

Word Cloud

Buffet letters word cloud

The words earning and stock have a highlight in the letters, as expected since their core business is to make money from stocks, but other interesting words that also have a great highlight are insurance and Geico. Showing how Buffet really likes the insurance sector and confirms this interview to Forbes when he said that Geico was the number 1 investment of his life.

Other words that also caught my attention were CEO, showing that Warren invests not only in great companies but in great people and shareholder, showing that Buffet is always concerned with the well-being of its shareholders.

Frequency Heatmap

Based on the most used words in the letters, I plotted a heatmap to see if over the years there has been a change in Warren’s speech or if he has remained constant.

Heatmap of Words Frequency, in the x-axis, are the letters years, in the y-axis are words, and the closer the cells to white color indicates a low frequency and the closer to green indicates a higher frequency.

We can notice that the words insurance, shareholder, gain, loss, and stock were very present in the letters throughout the years analyzed. The word CEO started to appear more constantly in the letters only from the mid-90s.

If we look at the row of the word loss, we see that in 2001 this word was used a lot. Probably, because of the attack on the World Trade Center that caused the insurance industry to suffer its worst loss in history so far.

Another point that stands out is the year 1995 of the stock row, where the creation of class B of Berkshire shares was discussed. Since this was an important decision in the company which was going to affect all the Berkshire investors he had to explain how this creation was going to work. That’s why there was greater use of the term stock in that year.

In the bond row, we see the highlights in the years 1984 and 2008. In 1984 he wrote about his investment rationale to buy bonds. In 2008 he wrote about the tax-exempt bond insurance market.

But as we have not seen any strong pattern in the letters from this heatmap, I grouped the data into a 5 years period to analyze if any pattern emerges.

Heatmap of Words Frequency grouped by 5 years period

Even plotting the heatmap grouped, we could only see a trend at the CEO row, where the frequency increased over the years.

But why Warren Buffet began to put the word CEO more frequently in the letters?

Digging deeper at Buffet’s life, I found that he started his investment strategy by buying cigar butt companies.

“If you buy a stock at a sufficiently low price, there will usually be some hiccup in the fortunes of the business that gives you a chance to unload at a decent profit, even though the long-term performance of the business may be terrible. I call this the “cigar butt” approach to investing. A cigar butt found on the street that has only one puff left in it may not offer much of a smoke, but the “bargain purchase” will make that puff all profit.

The CEOs of cigar butt companies did not play a strong role in the investment, since Buffet’s intention to those kinds of companies was only to make a profit in the short term. But, his partner Charlie Munger convinced Buffet to change his investment strategy and focus on value investment (a strategy that involves picking stocks that appear to be trading for less than their intrinsic or book value). For value investors, that focus on long-term performance, who are the CEOs of the company plays a big role in whether or not to invest in it.

This change in Warren investment mindset occurred in late the 80s, as he wrote in the 1989 letter:

“But now, when buying companies or common stocks, we look for first-class businesses accompanied by first-class managements.”

Thus, the likely reason for the increased use of the word CEO by Buffet was his investment strategy change from buying cigar butt companies to buying stocks.

After that, I tried to get Buffet’s thoughts about some topics that are hot nowadays, such as bitcoin, blockchain, crypto, forex, options, AI, ESG, Tesla, and FAANG. But those terms never appeared in his letters. Confirming his “circle of competence” concept. A circle of competence is when an investor sticks to areas they know about when deciding what companies to invest in.

As we can see, Buffet’s speeches don’t seem to change much over time, with some one-off events where he focuses on some specific area or important event for his shareholders. Changing only to talk more about CEOs when he changed his strategy from buying cigar butts to buying value companies.

How was his feeling about the Economy and Stocks Market over the years?

The last curiosity I had was if he shows his emotions in the letters or if he always leaves his emotions aside and has a more neutral tone.

To do this analysis, I took advantage of the power of the transformers again this time using the pre-trained ‘sentiment-analysis’ pipeline and also used the Sentiment Intensity Analyzer using VADER (Valence Aware Dictionary for Sentiment Reasoning) technique to see if there was any difference between the two techniques.

What is Sentiment analysis?

Sentiment analysis or Opinion Mining is a sub-set of NLP that tries to extract whether a text is positive, negative, or neutral (some sentiment analysis only classify text as positive or negative).

Sentiment analysis using Transformers

To use a pre-trained transformer in the sentiment analysis task in Python is easy. With 3 lines of code, you can start to classify your sentences in negative and positive.

>>> from transformers import pipeline
>>> classifier = pipeline('sentiment-analysis')
>>> classifier('We are very happy to show you the 🤗 Transformers library.')
[{'label': 'POSITIVE', 'score': 0.9997795224189758}]

The output of the sentiment analysis pipeline is a dictionary with a positive or a negative score that add up to 1.

Before I use the sentiment analysis pipeline I did some preprocess in the text to get better results. This process is summarised in the diagram below.

I used a sentence tokenized to split the text of the letters into sentences, then I did a cumulative sum of the sentiment analysis of each sentence, after that, I did a normalization of the results

Finally, I draw the results in a heatmap plot.

Heatmap from Sentiment Analysis using pre-trained Transformers

This heatmap surprised me, as the model rated most years as negative. The years 2001 and 2008 had a more negative feeling, probably due to the 9/11 in 2001 and the financial crisis in 2008.

Another interesting fact is that the rating of the 2020 letter was more positive than negative, despite the pandemic. After all, as we can see today Buffet’s felling was right because from the bottom in stock prices on March 20, 2020, until now the S&P 500 has already risen more than 70%, so ‘Never bet against America’.

Warren Buffet in his 2020 annual meeting

Ngrams Sentiment Analysis

For realizing the Ngrams sentiment analysis, I used the nltk package to get the 20 most frequent unigram, bigram, trigram, and quadgram from the texts of all letters and used the transformer to classify the ngrams in positive and negative.

You can see the result in the chart below. The positives ngrams are on the left in blue and the negatives ones are on the right in red. It’s interesting to note that, excepting for the ngrams that contain the word loss, all other negative ngrams do not seem to be negative, like the unigrams geico and stock.

Ngrams sentiment analysis bar charts, in the left there are the positive ngrams and in the right, there are the negative ngrams. To plot these bar charts I used a function implemented at this kaggle.

Sentiment analysis using Sentiment Intensity Analyzer VADER

VADER (Valence Aware Dictionary and sEntiment Reasoner) is a lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in social media.

The model using the Sentiment Intensity Analyzer (SIA) VADER technique classifies the texts into 3 categories, negative, positive, and neutral. The process to analyze the letters followed the same steps as the transformer, changing only the technique used to calculate the sentiment and the result is in the heatmap below.

Heatmap from Sentiment Analysis using SIA

Using the SIA, we had the text for all years considered neutral differently from the transformers analysis. Potentially because the data used to train the model was from social media. Comparing social media text to Buffet’s informative tone in the letters is not a surprise that SIA using VADER classified all the letters text as being neutral.

Those conflicting results were insightful for my next sentiment analysis using pre-trained models. Because showed that the data used to train the model has an important role in its performance.


  1. In this post, we did not found any strong pattern changes over the years in the words used by Buffet to write his letters.
  2. The letters had a sentiment negative or neutral and the two pre-trained models used had conflicting results, showing that we need to find a pre-trained model whose data used to train is similar to the data you want to classify.

You can check the code used to write this post at my Github repository. Feel free to reach me with any comments on my Linkedin account and thank you a million for reading this post.

If you like what you read be sure to 👏 it below, share it with your friends and follow me to not miss this series of posts.




Jair Neto
Analytics Vidhya

ML engineer / Analytics engineer | UCI & UFCG Alumni