Are changes in a company’s annual report (10k) correlated to changes in its stock price?

Swaraj Patankar
May 2 · 6 min read

A companies annual report provides valuable information about its current state and strategic direction. The content of an annual report can also change dramatically from year to year.

Let’s find out if changes in an annual report are correlated to a companies stock price:

The first task is to identify a company for this analysis. I decided to go with Whirlpool because it is an American company, is in a relatively stable industry, and has been public for a long time.

It is also a company which has been in trouble, especially in high growth foreign markets (e.g. Brazil). I decided to only use the textual content of the 10k. Numbers change every year and drastic changes in the numbers on a 10k (specifically in the financial statements) would intuitively signal a change in the stock price. Because I am only going to analyze the textual contents of the 10k, I would theoretically be able to see whether changes in the information related to Whirlpool’s strategy, m&a activity, and risks (among other things) have an effect on the stock price.

Getting Text Data From a 10k

I downloaded the 10k’s for Whirpool between 2001–2019 from the SEC EDGAR database:

***Note: make sure to specify the Filing Type as “10-k” in the search bar

Now you might be wondering: how can I scrape text from a PDF using Python?

I decided to go with the Tika library. Tika makes scraping text data from a PDF surprisingly easy. First, I imported the package and created a list of the names of all my 10-k’s. Make sure the 10-k’s are stored in the same directory as your python code.

from tika import parser# A list of all the names of the 10-ks I usedstatements = ['whirlpool_2005.pdf',  'whirlpool_2010.pdf',  'whirlpool_2015.pdf',
'whirlpool_2001.pdf', 'whirlpool_2006.pdf', 'whirlpool_2011.pdf', 'whirlpool_2016.pdf',
'whirlpool_2002.pdf', 'whirlpool_2007.pdf', 'whirlpool_2012.pdf', 'whirlpool_2017.pdf',
'whirlpool_2003.pdf', 'whirlpool_2008.pdf', 'whirlpool_2013.pdf', 'whirlpool_2018.pdf',
'whirlpool_2004.pdf', 'whirlpool_2009.pdf', 'whirlpool_2014.pdf', 'whirlpool_2019.pdf']

Now we are going to parse each 10k individually and append the contents to a list called “raw”:

raw = []
for k in statements:
temp = parser.from_file(k)
temp = temp['content']
temp = re.sub("\b\d+\b", " ", temp) # remove digits
temp = re.sub("\.", " ", temp) # remove dots
temp = re.sub("\([A-Z]\)", " ", temp) # remove (LETTER)
temp = re.sub("\([0-9]\)", " ", temp) # remove (NUMBER)
temp = re.sub("\n", " ", temp) # remove "\n"
raw.append(temp)

You may have noticed that I used some regular expressions to clean up the text data. I took the following steps:

  • removed digits
  • removed dots
  • removed parentheses with a letter between them
  • removed parentheses with a number between them
  • removed newline characters

Getting the text and cleaning it up was relatively straightforward, the challenge is coming up with a metric to actually compare the 10k’s. I decided to use cosine similarity which is based on counting the maximum number of common words between documents. You can find more information here: .

I created two functions using the collections and sklearn package. The function “get_vectors” creates vectors containing the count of each word in the text.

from collections import Counter
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
def get_cosine_sim(*strs):
vectors = [t for t in get_vectors(*strs)]
return cosine_similarity(vectors)

def get_vectors(*strs):
text = [t for t in strs]
vectorizer = CountVectorizer(text)
vectorizer.fit(text)
return vectorizer.transform(text).toarray()

I proceeded to create a list to store the cosine similarity scores. Then, I iterated through my list (“raw”) containing the text of all the 10k’s for each year and calculated the cosine similiarity for the initial year and next year until I reached the end of the list.

sim_scores = []counter_a = 0
counter_b = 1
while counter_a != len(raw):
while counter_b != len(raw):
sim_scores.append(get_cosine_sim(raw[counter_a],
raw[counter_b])[0][1])
counter_a += 1
counter_b += 1
break

Now we have a list of cosine similiarity scores from 2002–2019, but what we are looking for is the delta of the scores. Let’s first put the scores in a DataFrame.

sim_scores = pd.Series(sim_scores)years = np.arange(2002,2020)years = pd.Series(years)df = pd.DataFrame({'Year' : years, 'Similiarity Score' : sim_scores})df = df[['Year', 'Similiarity Score']]df.head()
DataFrame with Cosine Similarity Scores

The issue is that we want to look at the textual change between various years on the 10k. To do this, let's use the pandas function: pct_change on our DataFrame. We first have to set the year as the index as we don’t want the percent change of that specific column.

df.set_index('Year', inplace = True)df = df.pct_change()df = df.iloc[1:,:]df.reset_index(inplace = True)df['Similiarity Score'] = df['Similiarity Score'].abs()df.head()

Getting Whirlpool Stock Data

I have to admit that I cheated here and copy/pasted the year end prices for Whirlpool from Yahoo! Finance. Since their API is deprecated, it is much harder to get stock price information. You could also use pd.read_html but sometimes, using Microsoft Excel is faster!

A CSV version of the year end stock price data can be downloaded here:

Let’s cleanup and format the stock price data, take the year-to-year delta, and merge it with our original DataFrame containing the delta’s of the cosine similiarity scores.

prices = pd.read_csv('whp_closing_prices.csv')prices = prices.sort_values(by = 'Date')prices = prices.iloc[1:,:]prices.head()

If you look at the index, you’ll notice that it starts at 17 since it was sorted in ascending order. This is essential to fix since this DataFrame will be merged with our original DataFrame containing the cosine similiarity scores. To fix this, we need to do the following:

prices.columns = ['Date', 'Close']prices.reset_index(inplace = True)prices.drop('index', axis = 1, inplace = True)

We now need to get the percent change of the stock price data from year to year.

prices.drop('Date', axis = 1, inplace = True)prices = prices.pct_change()prices = prices.iloc[1:,:]prices.reset_index(inplace = True)prices.drop('index', axis = 1, inplace = True)

We’re almost done preparing our data. Some stock price delta’s are negative and need to be changed to be positive. This is because all we care about is the magnitude of the change. I also multiplied the cosine similiarity and stock price data by 100 to make it easier to read.

prices['Close'] = prices['Close'].abs()prices['Close'] = prices['Close'].apply(lambda x: x*100)df['Close'] = pricesdf['Similiarity Score'] = df['Similiarity Score'].apply(lambda x : x*100)df.head(10)
We now have a nicely formatted DataFrame!

Lastly, lets use the “corr” function to get the correlation between the cosine similarity delta’s and the stock price delta’s.

df['Similiarity Score'].corr(df['Close'])

It seems like there is a slight correlation between the % change of the cosine similarity scores and the % change of the company’s stock price. This correlation is fairly low but it would be interesting to do this analysis with other companies using metrics other than the change in the companies stock price (maybe moving average or volatility?).

Next Steps

  • Get data for more companies, not just Whirlpool
  • Use older data
  • Use the moving average, volatility, volume, and other metrics derived from the companies stock price
  • Use specific sections of the 10k (Risks, M&A Activity, Competitive Position etc.). Specific sections can be found at this link: .

Link to code/relevant documents

*** I couldn’t upload the the 10k’s to GitHub because of file size restrictions but the year end stock prices are there. Please use the CSV version.

The Data Guy

Finding cool insights with data

Swaraj Patankar

Written by

Insights & Data Consultant | swaraj276@gmail.com

The Data Guy

Finding cool insights with data

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade