U.S. stock price analysis in the covid-19 pandemic
What happens when two curious minds with an interest in data and global events, still on the lookout for their future destinations, join forces for a data-driven project?
In our case, it’s been a whirlwind of new concepts, trials and errors and lots of googling with a steep — sometimes dizzying — learning curve. The outcome? An analysis of stock price movements during the period of the coronavirus pandemic, coded in Python end to end.
The project’s goals and our own ambitions
Since the start, our project has been explorative in nature. A few preliminary assumptions served as a stepping stone and more specific goals have crystalised along the way. Overall, we aimed for an analysis of the US stock price movements in relation to the growing number of coronavirus cases, taking into account differences among industries and individual companies. Eventually, we expanded the analysis to include stock news sentiment analysis.
Safe to say, we decided early on that our own learning and opportunity to practice new methods and skills would come first. Our main shared ambition? To build a project on our own, without the need for commercial tools, from data collection to visualization. In agreement with our mentors, we picked Python as our weapon of choice.
In order to practice the language as much as possible, we wanted to be comfortable working with a number of libraries for data scraping, cleaning, transformation, (statistical) analysis and visualization, eg. pandas, numpy, matplotlib or seaborn. At the far end of less realistic goals, we also wished to get our hands on a bit of machine learning and toyed with the idea of creating a predictive model. A familiarity with Git and Github emerged as a necessity.
What data did we use?
- Stock prices by indices scraped from Yahoo Finance (via yfinance and yahoo_fin libraries)
- Coronavirus data sourced from Kaggle + scraped from Worldometer (via beautifulsoup)
- Stock news scraped from investing.com + sentiment data (via vaderSentiment)
Used developer tools
- Python 3 — used libraries: pandas, numpy, yahoo_fin, yfinance, datetime, os, path, glob, matplotlib, seaborn, requests, lxml, sklearn, fuzzysearch
- Git, Github
First, we use the yfinance and yahoo_fin libraries to scrape stock price data in the period from January 2019 to end of October 2020 for companies listed on some of the major US-based indices. We make sure the code is reusable so that we only need to change one line of code, the name of the index, to access the relevant data. Similarly, the time period can easily be changed if needed.
Data on coronavirus cases is sourced from a comprehensive dataset found on Kaggle which includes detailed counts of new cases, total cases, deaths etc. for every country from the start of the pandemic.
Stock news are scraped from investing.com.
Getting our data in shape
Next, we transform collected data on stock prices into data frames according to our needs, merging them with information about individual companies (name, sector, industry etc.) to be used as a filter in analysis and visualization. The first few steps, from scraping to transformation, are first tested on the Dow index (includes only 30 companies) before moving to the S&P 500 which we then use throughout the project.
Since we focus on US-based indexes, we also narrow down available coronavirus data, creating a separate data frame for the United States in the period from early March (first cases reported in the country) until the end of October 2020.
Sector overview: let’s normalize it
As the next step, often revised and close knit with statistical analysis, we dive into visualisation. Soon enough, we can see the first trends, the ups and downs, hidden among the rows and columns of the massive datasets. Each chart offers new insights and guides our steps forward. Basic line charts lead to multiple layers and sources of data displayed at once, original stock prices lead to normalization.
We realize that stock prices across sectors differ too much to be comparable. That is why we opt for normalization (ranging from minimum to maximum price within the chosen period) which makes it possible to draw relevant insights from the visuals and we adhere to normalized values throughout the rest of our project.
The advantage of this approach is clear from the following charts. The first one shows original stock prices across sectors compared to the number of new covid cases, the second one uses normalized values that give us a clearer picture of the rises and falls.
The above charts provide a good first overview of the situation, revealing a common fall and similar patterns with some sectors outperforming the others. However, when 500 listed companies are categorized into 11 sectors, too many details and differences remain hidden in the data. For example, the Industrials sector is quite diverse and while it seems to be doing well overall, the situation in the Airlines industry, one of its components, is very different. That is why we proceeded to explore specific industries within the sectors, followed by individual companies later on.
Closer look at industries
Once again using normalized values ranging from 0 to 1, a quick look at a few selected industries reveals clear differences in their price movements. The chart below, which allows for year on year comparison, confirms early assumptions one may have about the stock market’s behaviour in the time of covid when travel is restricted, people stay at home and their work and leisure time is spent online.
While Airlines, Travel Services or Aerospace & Defense industries did well before covid reached the US and the stock market took a plunge, they show extremely slow recovery if any. On the other hand, Internet Retail, Electronic Gaming & Multimedia and Diagnostics & Research seem to be the winners of the epidemic, faring much better in the past months than before.
…and individual companies!
Finally, it’s time to look at the (big) names! Thanks to detailed granular data of stock prices for each company and each trading day, we can further drill down from industries to companies within them. Now we can see even more variance, with some industries being more consistent in their behaviour (as in the example of Online Retail below)…
Correlation between new cases and normalized opening price: industry average vs. individual stocks (Online Retail)
… than others (such as Drug Manufacturers — Specialty & Generic in which case individual companies span a much wider range and one of the companies, Mylan, falls far behind the others).
Correlation between new cases and normalized opening price: industry average vs. individual stocks (Drug Manufacturers — Specialty & Generic)
The scatter plots above offer another important insight on top of the previous charts. That is the correlation between stock price changes and new coronavirus cases, going beyond simple linear time series, which presents case numbers regardless of the date they occured.
Correlation: Who’s in step with covid?
Once we broke free from the linear timeline, a host of ideas opened up and we set off to explore. What is the connection or relationship between cases and prices? Do we focus on industries again or one company at a time? Shall we use new cases or total cases or something else altogether?
Trial and error and a fair share of reasoning later, we saw the greatest value and potential in focusing on individual companies and the question was, whose behaviour is most in step with newly reported cases and who carries on unaffected? We tried to compute Pearson correlation coefficient between new covid cases and opening prices for all collected stocks. When we selected 10 companies with the strongest and 10 companies with the weakest correlation, we noticed a few familiar names.
Let’s take a look at a few selected companies. Among those with strongest correlation levels appeared the names of Amazon, Netflix or EA, serving as further confirmation of the previously seen trends in industries but also adding the notion of a strong relationship with new case numbers in particular. The examples of Amazon and Netflix are plotted below. When we look at the scatter plot and the line chart on its right, we can see that, in these two cases in particular, as the number of new cases goes up, so does the stock price.
On the other hand, some of the weakest correlations appeared, for example in the case of Boeing and a number of US-based regional banks such as U.S. Bancorp. In both cases, the timeline charts show a much steeper fall at the beginning of the covid period and fairly low and stagnant prices even as the number of new coronavirus cases kept rising.
Sentiment analysis: what does the news say?
To capture factors other than COVID in our stock price analysis we decided to conduct a sentiment analysis of financial news scraped from investing.com — platform that provides real-time data, quotes, charts and historical news analysis across multiple exchanges.
Note on the source: While one mind think that there is a myriad of different platforms that store historical financial news data, unless you are willing to pay money for a premium membership (e.g. $49.9 for a one-month subscription with Yahoo Finance or even $250 for Factiva), the options are rather limited. Investing.com is the only platform we found that publicly displays historical stock news for free for an extended period of time.
The scraping was fairly straightforward as the source code of investing.com is very coherent and consistent. However, two things to watch out for while scraping:
1) sponsored articles and ads: sponsored articles and ads may pop up but are fortunately easy to distinguish from regular financial news as they have no date attached.
2) current date as the start date: no matter the time frame for which you wish to extract the news, you always need to start crawling with today’s date as the start date. The reason is that the website is structured as a chronological list of news, broken down into several web pages, but with no option to filter by the news date (hence making it hard to know on which page you want to start scraping).
Once we scraped the news data and put it into a pandas dataframe (columns: article headlines, text and date), we applied the sentiment analysis (using vadersentiment: https://github.com/cjhutto/vaderSentiment.git) with the output of four scores for each article — positive, negative, neutral and compound score.
The real struggle came when we tried to match our S&P 500 companies to the articles based on the news headlines. After some fairly lengthy internet research we realized that there is not one recommended way to do this, but rather approaches, each with its own benefits and costs. We mainly experimented with the following two:
1) TF-IDF + cosine similarity: A failed attempt
In a nutshell TF-IDF is a technique used to calculate the importance of each word in a document based on the relative frequencies with which this word or ngram (part of that word) appears in the document and in the whole corpus. Part of this exercise is breaking down the text documents (in our case the series of article headlines) and the search terms (S&P companies) into ngrams and vectorizing them:
The next step is to apply cosine similarity — a technique used to measure the angle between two vectors (ngrams) to indicate how similar they are.
The result of this step was a list (or rather a list of lists) of probability scores for the likelihood of a match, which we converted into a pandas dataframe (lines: article headline, columns: companies).
With just four lines of code, we were impressed with the sheer simplicity of this method. Yet after looking at the output we realized that it is not THAT easy. Since the article headlines are of different lengths than the company names, the level of analysis needs to be brought down to the level of individual words or a series of words. Namely, for each company name we would have to parse through each of the article headlines, breaking the text down into unique word combinations (with the number of words in each combination determined by the number of words in the company name) and find matches at this level.
This method suddenly became a little cumbersome, given that our company names matched closely with the names mentioned in the articles, so the main benefit of this method — high matching effectiveness — was outweighed by the costs — higher complexity and substantial processing power to execute. That is why we decided to work our way around it with Fuzzy search.
2) Fuzzy search: A convenient workaround
We turned to a second method — Fuzzy search (fuzzysearch library — find_near_matches). This also took only a few lines of code and very little time to run.
With this method one can adjust the max_l_distance, which is the number of characters, which may differ between the search term and the parsed text. In our case, our search terms closely matched with those used throughout the article headlines, so we received best results when we set the distance to 0.
Once again, we received the output in the form of a list of lists, but this time with no probability scores, but a binary indication in case of a match (empty string or “match”), which we converted into a dataframe (lines: article headlines, columns: company names).
Browsing through the results, however, we realized that the number of articles for each company is very limited and hardly enough to be used for a conclusive analysis (for illustration, during our time window, Apple — company with substantial media attention — was mentioned in just 13 articles). For this reason, we looked at the sentiment only at the level of the whole S&P 500 index (i.e. only at articles, which mentioned the term S&P).
What was the S&P 500 sentiment during COVID?
For dates with more than one article, we calculated the mean of the compound scores and plotted that into a bar chart:
Sentiment above zero indicates positive news, below zero — negative and around zero — neutral. There is probably little surprise that sentiment is significantly skewed towards the extremes (either very positive or very negative) since article headlines need to be captivating enough to attract readers. However, what may come as a surprise is the substantial outweigh of positive news during the COVID period. Browsing through a few of these articles, we arrive at a potential explanation: while there are obvious differences within the index in how companies / industries cope with COVID, putting these intra-index differences aside, the S&P 500 index as a whole performs better than the overall market. However, additional analysis is needed to confirm this hypothesis.
Is sentiment in sync with index growth?
In addition to the sentiment evolution, we wanted to see if sentiment corresponds to the actual index growth, so we plotted S&P 500 sentiment value against the actual index growth rate. We normalized both parameters and smoothed the sentiment value (due to high polarity and missing values) by calculating a monthly moving average. The results indicate that both parameters move in tandem and that news follow rather than predict the index growth — a reasonable finding given that majority of the index news is descriptive in nature (i.e. provides a post factum description of the index performance).
In addition, it can be seen that after the initial drop in March 2020, the index has been growing back and reaching its initial levels in June 2020. Index growth continued until the end of August 2020 when the second wave of the pandemic set in — yet this time the index dropped with a much smaller magnitude. Sentiment followed a similar path yet with a much more volatile trajectory; although given the limited amount of data, this finding is inconclusive.
While working on the project, we encountered several limitations that shaped its final outcome. These can be seen as lessons or starting points for other projects.
- Not enough data: Our ambitious idea of creating a predictive model failed due to a lack of data, especially complete comprehensive data on the entire period of the pandemic, as it is still ongoing.
- Not enough time: In terms of sentiment analysis, more text content on the index and individual companies would be needed for a sound analysis (eg. tweets, comments)
- Not enough resources: As stock market news are themselves a commodity and financial news websites carefully guard their content, better results would be achieved for example with access to premium Yahoo Finance features, including individual company news and commentary, and / or other premium databases (e.g. Hoovers, Factiva..)
Value of the project
- Insights gained from bringing together stock prices, covid cases and stock news data
- Opportunity for small investors to do their own analysis without the need to access expensive commercial platforms (especially in case of expanding the scope to more indices, countries or stock news)
- An inspiration to any student new to coding who wants to embark on a similar learning trajectory
Since the main goal of our project was to learn and try out as many things as we could, it’s safe to say we succeeded or in some ways even exceeded our expectations.
Our roles & final notes
From the beginning we worked on the project together, discussing our ideas, next steps and more often than not also the issues we encountered along the way. Luckily, many of our interests and grievances aligned and we could support each other throughout the project. We took turns sharing our progress and writing code. At first, I may have been more familiar with Git and Github so created the project repository and helped out when I could. Though we shared our tasks, sometimes we interpreted the next steps agreed with mentors differently and each came up with a different approach. I mainly contributed with visualizations, ideas and annoying questions of “what’s the point of this” and “what does that really say”, hopefully for the benefit of greater clarity, not to the detriment of the project. I am extremely grateful to Y. who did a great deal of work in the final two weeks, especially on sentiment analysis, when I was overwhelmed with other commitments, indulging her new-found coding talent as well as the ability to quickly grasp new concepts and use new libraries.
This has been an amazing learning journey! I am very grateful for the great chemistry we had and how we complemented one another. When one of us was struggling and did not understand something, the other one often did and could explain it to her. The project did not have a clearly defined path as we agreed upfront that we would adjust this organically along the way to best suit our learning needs and interests. Yet we always agreed on the next steps and aligned on our interests in the analytical techniques we want to experiment with. Sometimes we split tasks, sometimes we worked independently on the same exercise and then shared our findings with one another. This has always been interesting since we often approached problems differently and best output often came out when we combined our insights. My main contribution was scraping of the financial news and carrying out the sentiment analysis alongside the shared endeavours on the regression and correlation plots, data merging and transformations. I am also very grateful that thanks to the neverending patience of our mentors, I managed to satisfy my geeky wish to work in a remote environment and set up an AWS instance.
Finally the biggest thank you goes to our mentors for their great energy, everlasting support and patience.
The full github code can be found at https://github.com/veronikahalamkova/da_projekt.git