“Newstrace” — A method to analyze the impact of news articles on stock prices
A lot of aspects of the real world are covered by news articles, which are published online. And so this data is a valuable source to measure and analyze the impact of the real world on stock prices. There are many methods and solutions trying to squeeze out information as much as possible from these text documents, and use it for insights and predictions.
Recent advances in machine learning and natural language processing brought a bunch of new possibilities. Most of them use sentiment analysis to find positive and negative signals in the texts. A nice introduction can be found in the article https://medium.com/@chengweizhang2012/simple-stock-sentiment-analysis-with-news-data-in-keras-1478b96dd693.
Later this month the interesting post “Using the latest advancements in deep learning to predict stock price movements” (https://towardsdatascience.com/aifortrading-2edd6fac689d) built a model for stock prediction using methods containing all of today’s popular buzzwords as:
- Generative Adversarial Network
- Recurrent Neural Network
- Convolutional Neural Network
- Bayesian optimisation
- Deep Reinforcement learning
- stacked autoencoders
- NLP with BERT and so on. This forms a huge “black box” of technologies trying to find signals in the data.
In contrast to this, we want to develop the other method, which uses some machine learning techniques guiding human gaining insights into the data and the market. Interactive graphics will help to understand and interpret the data. It includes the following steps:
- We start with a published news article for which we want to analyze his possible impact on a stock price.
- We are searching for similar news articles from the past with machine learning methods, building the so-called “Newstrace”.
- Checking the similarity of the documents on the “Newstrace”
- Get the stock data and plot the time series in the inspected timespan
- Calculate and plot the differences of stock prices between days on the “Newstrace” and others
- Investigate the differences with histograms and “Kernel Density Estimate” plots
- Calculate the significance of the difference of means
- Analyze the sentiment of the news documents with machine learning
- Interpret the results
Start with a news article
To show the method we use an article about Amazon and the Amazon stock prices (“AMZN”) as an example. The starting point of the analysis is the news article published on 18.1.2019:
Amazon pushing hard into ocean shipping, making it easier for Chinese goods to get to you
Quietly and below the radar, Amazon has been ramping up its ocean shipping service, sending close to 4.7 million cartons of consumers goods from China to the United States over the past year, records show.
This marks a significant move into what many believe is the company’s overall strategy of eventually controlling much of its transportation network, from trucks to airplanes and now to ships. …
SAN FRANCISCO - Quietly and below the radar, Amazon has been ramping up its ocean shipping service, sending close to…news.yahoo.com
Find similar news data in the past
We have to find the most similar news article to the given article in the past 2 or 3 years. This defines the “Newstrace” on the time axis of this article. The “Newstrace” is the subset of days in the past similar news was published.
We can use different methods and tools to find similar news to the given article. One is to use software using machine learning methods which were trained on a large amount of new data. For example, we can use the plugin “NewsBot” for Google Chrome (https://getnewsbot.com).
But there are several others around you could use for the task.
For the news about Amazon in the example, we found 20 news posts and stored the information about them in a CSV-file. (You can download it from Github https://github.com/astoeckl/newstrace)
The file contains information about:
- Link to the article
We load the data with “Pandas” into a DataFrame:
Check similarity of documents on the Newstrace
First, we look at all the articles to check if they are similar enough to have a valid “Newstrace” for our further analysis. You could do this by reading all articles to check for consistency. For a quick check, we also want a numerical measure of the similarity, which we can calculate by software.
There are different solutions you could use. In this example, we use the “Dandelion” API for “Semantic Text Analytics” as a service (https://dandelion.eu).
We can check the pairwise similarity of documents with this API for testing text similarity. (https://dandelion.eu/semantic-text/text-similarity-demo/)
After registering and getting an API token we call the API with 2 URLs of news articles to compare them:
We got a value of 0.4211 for these two documents. Now we can loop over all documents in the “Newstrace” and compare them with the article we are analyzing to get the values of similarity:
[0.7653, 0.3319, 0.7373, 0.4211, 0.7717, 0.7009, 0.6418, 0.6803, 0.3319, 0.7443, 0.7509, 0.7652, 0.7881, 0.7252, 0.7472, 0.8107, 0.4211, 0.7412, 0.4211, 0.3319]
With the basis of this list of similarity scores we want to define a single number measure for the quality of the whole “Newstrace”. One possibility is to use the mean of the list. Others could be “Root Mean Squared Numbers” or the minimum of the list.
We got 0.631 in this example. This number can serve as a measure of how accurate the “Newstrace” is tracking the history of the news article. To know which number is high enough we have to build up some experience by manually checking the news posts and calculate the quality measure of the “Newstrace” for some more examples. For the purposes of the example in this post, the similarity is high enough.
Get stock data in the desired time span
In the example, we use the stock prices of Amazon (“AMZN”) and analyze the movement on the “Newstrace”.
We just use the columns for volume and adjusted the close price in the example, so we drop the other columns and rename them. The prices are converted to numeric data.
The daily changes are calculated, stored in the column “change” and rounded.
Plot the time series for price
We use the Plotly library (https://plot.ly/) for displaying the stock data and to add interactivity. An introdution using Plotly can be found in the article https://towardsdatascience.com/introduction-to-interactive-time-series-visualizations-with-plotly-in-python-d3219eb7a7af
The plot shows the time series of the closing price in the timespan of the news articles.
Add the Newstrace to the plot
We mark the days where similar news articles were published in the plot — the so-called “Newstrace”.
Show the daily changes on the “Newstrace”
We annotate the daily changes on the days of the Newstrace and colour them red and green to get a first impression if any patterns are visible.
Before the plot, the two data frames for stock prices and news data are merged with Pandas on the “Date” columns and only one entry per day is stored.
In the plot of the time series we add the shapes for the Newstrace as above, but with colours if the change is positive or negative on that day. The amount of the change on that day is added as annotation near the shape.
From this picture, it is not clear if the movement on the “Newstrace” is more up or down and if there is a pattern.
In the plot, we can also see the distribution of the publication dates over the investigated time period. The dates look like random points distributed over time with almost uniform distribution. Except in the last half year there no similar news was published.
Histograms of the daily changes
First, we restrict the time series to the timespan under observation.
To analyze if there are differences of the daily changes we plot the histograms of the values on the “Newstrace” and on the other days. The means of these two samples are also added to the plot.
Another possible visualization — the KDE Plot
A histogram can be thought of like a scheme in which a unit “block” is stacked above each point on a regular grid. The choice of gridding for these blocks can lead to wildly divergent ideas about the underlying shape of the density distribution. If we instead center each block on the point it represents, we get an estimation of the density distribution. This is called kernel density estimation with a “top hat” kernel. This idea can be generalized to other kernel shapes, for example, Gaussian kernel density estimate.(https://en.wikipedia.org/wiki/Kernel_density_estimation)
We use Gaussian kernel density estimate in the example:
Test the differences of means
The distributions and means in the pictures of the histogram and the KDE plot show only a small difference of the stock price movement on the “Newstrace” and besides. We want to test now if this difference of the values is significantly different from zero.
First select the two sets of days, on and beside the “Newstrace”
Calculate the Student T-test statistic (https://en.wikipedia.org/wiki/Student%27s_t-test) for the two samples of daily price changes:
This high p-value shows that the hypothesis that the difference of the two samples come from random cannot be rejected.
So as the histogram and the density plot suggest, there is no significant difference. The analysis of the news article does not show an impact on the stock price of Amazon.
In a future post, we will show an example that suggests an impact from the “Newstrace” on the price movement of some stocks.
Sentiment analysis of the news
Studies showed that the sentiment of news articles has an impact on stock prices. So we want to add additional information to the sentiment for each article on the “Newstrace” and start an analysis with this data. Machine learning will be used to calculate a measure for the sentiment of a text document.
This will also be covered in a future article.