Quant Post 1: Building an equity data warehouse in PostgreSQL and Python

Antonio Constandinou
4 min readJul 15, 2018

--

This is the first project in my quantitative finance series that is meant to bridge my software development skills and quantitative hedge fund experience into various challenging and interesting projects.

Ever since reaching a great stopping point in my software developer journey (you can read more about that here), I’ve wanted to build a data warehouse of stock market data that would be used as a basis for future projects.

Bridging together new programming skills along with previous python experience would allow me to tackle basic, but interesting quantitative projects. This is also an outlet for me to share my passion for the quantitative space, while also learning new subject matter. This PostgreSQL database is the first step in pursuit of that endeavor.

GitHub Repository

For access to the code base that I will be discussing in this blog post, here is a link to the GitHub repository. I welcome any feedback or comments on my code and approach (please be nice).

Approach and Implementation

Before tackling this project, I had learnt PostgreSQL via my online coursework. Our back-end course used Ruby and the command line interface to learn fundamentals of SQL. I utilized the psqlconsole and its many meta-commands to create and interact with our tables within various databases.

However, I wanted to integrate that knowledge into Python and automate as much as the process as possible.

My Python code handles the following:

  1. Building the database schema which includes all our tables and columns (i.e.: column names, types, constraints, foreign keys). Link to code.
  2. Web scraping Wikipedia for a list of S&P 500 listed stocks. Link to code.
  3. Web scraping Yahoo Finance for stock market data. Data collected includes Date, Open, High, Low, Close, Adjusted Close and Volume. Link to code.
  4. Outputting a text file for any stock data that we were unable to collect.
  5. Outputting a text file for all stock data we did collect, in addition to a stock’s first date and last date of data collected.

Some great libraries I got to learn and use in this project were psycopg2 , fix_yahoo_financeand bs4.

To accomplish the above requirements, I decided to piece out each key component into it’s own Python script. You can find all steps and explanations in the accompanying README.md file and GitHub repository.

“Conceptual” Level ERD

Arrows represent “many” in a relationship. For example: exchange has a (1:M) relationship with symbol.

Here we can see the following relationships:

  • exchange has a (1:M) relationship with symbol.
  • symbol has a (1:M) relationship with daily_data.
  • data_vendor has a (1:M) relationship with daily_data. Future expandability of our daily_data table could be made to allow for stock data with identical dates to come from other data vendors. This would establish a (M:M) relationship and would require a 3rd ‘cross-reference’ table.

“Physical” Level ERD

Legend — PK = Primary Key, FK = Foreign Key, N = Not Null

Yahoo Finance Data — The ‘easy’ days are over

For the purposes of a “non-institutional” grade market data database and with a starting budget of $0, I decided to use the fix_yahoo_finance library as a quick and easy way of getting market data. There are other approaches in collecting stock market data, but I figured I’d follow the path of least resistance with a resource I’ve used in the past.

As you’ll see further below, this approach did result in missing some of the expected stock data.

For my personal development purposes, I don’t see this as a big issue. Although I do plan on using this data for future quantitative research, I won’t be 100% confident in the results. The data itself possesses various biases and limitations and I will highlight some of those biases in future blog posts.

PostgreSQL Bulk Load Efficiencies

When dealing with loading financial data into my PostgreSQL table daily_data, I wanted to utilize a bulk load approach for each stock.

Our start date for our data was 12/30/2004 and our end date was 12/1/2017. These dates are arbitrary, as I simply wanted a good range of market data for future use. This date range has approximately 3245 trading days. With 505 stocks listed as constituents of the S&P 500, we could estimate roughly 1,638,725 lines of expected rows.

Total number of rows in our daily_data table after stock data was loaded.

The actual result of 1,506,514 rows of data has to do with two issues:

  1. We were unable to acquire data on a small number of stocks. A total of 13 symbols showed up in the outputted text file “failed_symbols.txt”.
  2. Some data that was pulled for a given market did not actually go back to our initial start date of 12/30/2004. Data for some stocks were pulled starting at a later date.

Before refactoring any code, I wanted a simple approach to bulk INSERT a stock’s market data.

Referencing the psycopg2 documentation on cur.executemany we see that it is not faster than simply executing execute() in a loop.

For any future need of inserting larger amounts of data, there are other solutions that I was able to read about. Here is one of the few interesting post discussing bulk SQLINSERT options. (CAUTION: the blog post is dated back to 2015. I have no association to the writer and website, nor have I tested the options explained.)

Last thoughts

This project turned out really well in my opinion. Although my online course taught me PostgreSQL within the confines of Ruby and the command line interface, I was able to apply well learnt fundamentals into a “Pythonic” approach to building a data warehouse.

We now have the data needed to tackle various equity focused quantitative research projects.

--

--

Antonio Constandinou

A finance professional who is passionate about programming, big data and quantitative research.