Predicting Stock Prices in 50 lines of Python.

In this blog post we’re going to build a stock price predication graph using scimitar-learn in just 50 lines of Python.

Is there something we can do to predict future stock prices given a data set of past prices? yes…. with the power of Machine Learning this sounds like a data science problem but according to the efficient market the stock market is random and unpredictable.

We’re going to build three different predictive models that predicts the price of Apple stock then plot them all on a graph to compare their results.

Steps:

  1. Install Dependencies
  2. Collect Dataset
  3. Write Script
  4. Analyze Graph

These are our four dependencies

pip install csv
pip install numpy 
pip install scikit-learn 
pip install matplotlib

CSV will allows us to read data from the CSV file of the stock prices.

Numpy will let us perform calculations on our dataset.

scikit-learn will let us build a predictive model

matplotlib will let us plot our data points with our model on a graph for us to analyze.

Let’s collect our dataset from google finance. Type in Apple in the search -> Click on Historical Data -> Download

Google Finance

Next step is to write our script.

Here at the top and we’ll use the given names to reference them throughout our code.


In the above code we have initialized two empty lists dates, and the prices. Then we write a function called get data that will fill them both with the relevant data we’ll call it get data and it’s argument will be the name of our stock prices CSV file. We will us with as block to open our file and assign it to the CSV file variable. The open statement will extract the content of our CSV file to read it hence the ‘r’ parameter, next we want to create a file reader variable which the CSV module will create for us using the reader method with our CSV file as the parameter this will allow us iterate over every row in our CSV file and we can return a string for each line using the next method. Which will call the next method first to skip the first row since it’s just a column names, now for each row in our CSV file reader we will add both the date and price values to our respective lists. The append function will allows us to add an item to the end of our list. We only want the day of the month so we’ll say get that first column in our row which is at index zero and use the spilt function to remove dashes between each of those three values then get that first value in the list which is the day. The return statement at the end to finish our with block.


Let’s move ahead and write our second and last helper function called predict price to build our predictive model and graph it well. First use numpy to format our list into an n by 1 matrix the three parameter will be the list we want to reshape the new shape which will be a one dimensional array the size of our dates list and finally the order of elements. Let’s create three models each of them will be a type of support vector machine.

Support Vector Machine is a linear separator it takes data that’s already classified and tries to predict a set of unclassified data. SVM’s can be used for regression as well the support vector regression is a type of SVM that uses the space between data points as a margin of error and predicts the most likely next point in a data set.


It’s time to create our graph. We’ll plot the initial data points as black dots with data label and plot each of our models as well we will use the predict method of the SVR object in scikit-learn using the dates matrix as our parameter each will be different color and we’ll give them a distinct label. We can set the x-axis and y-axis accordingly and we’ll add title in a legend.

The show function will display it on a screen and we’ll want to return the predictions from each of our models, now we can call our get_data method on our CSV and create a variable to store our predicted_price.


Run python script.py and now let’s analyze our graph. We can see that each of our models shows up in our graph and if you see RBF model seems to fit our dataset the best.


You can find the full code for this tutorial on my GitHub.