Engineering an Advanced ELT Pipeline for Optimizing Stock Portfolios

Published in

The Deep Hub

9 min readApr 7, 2024

PostgreSQL blue elephant standing on coins while overseeing the stock market. Money is flying in the air and three businessmen are holding briefcases while lookup at the elephant | Source: Author using Microsoft Designer

· Introduction
· The Imperative for Efficiency
· Architectural Blueprint
∘ — Why ELT?
∘ — Data Extraction
∘ — Data Loading
∘ — Data Quality & Integrity
∘ — Data Transformation
· Stock Prediction Model
∘ — Mean-Variance Optimization (MVO)
∘ — Input Parameters
∘ — Demonstration
∘ — Results
· Conclusion

Introduction

In today’s ever-changing financial markets, optimizing stock portfolios has become more important than ever. As an investor, I frequently find it challenging not only to choose which assets to invest in but also to determine the appropriate allocation for each one. In this article, I’ll walk you through the detailed process of constructing a robust Extract, Load, Transform (ELT) pipeline specifically aimed at enhancing portfolio performance by optimizing asset allocation. By the end of this article, you’ll understand what I’m doing, why I’m doing it, and how it can help you with your investments. Let’s get into it!

The Imperative for Efficiency

In the realm of portfolio management, the efficacy of decision-making hinges upon the timely availability and accuracy of data. Recognizing this imperative, the development of an ELT pipeline assumes pivotal significance. In conceptualizing this pipeline, my aim was to divide it into two distinct segments, each honing in on a specific objective.

Data Focused (upstream process): Retrieve pertinent stock data from a dependable data source and securely store it in a database.
Analysis Focused (downstream process): Leverage stored data to identify the most advantageous asset allocation for an investment portfolio, customized to the chosen stocks.

Architectural Blueprint

I architected this pipeline with the flexibility to utilize either local storage or cloud storage, providing adaptability to suit the specific needs and preferences of the project. This design choice empowers seamless integration into existing infrastructure setups while accommodating cost, scalability, and accessibility requirements.

ELT architecture for ingesting stock data, storing it in either a cloud or local database, and performing portfolio optimization | Source: Author using Excalidraw

Why ELT?

Instead of adhering to the conventional Extract, Transform, Load (ETL) pipeline, I opted for an alternative approach by choosing an Extract, Load, Transform (ELT) architecture which helps with the following:

Scalability: ELT leverages the processing power of the target database, allowing it to handle large volumes of data more efficiently.
Simplified Architecture: ELT simplifies the data pipeline architecture by consolidating the transformation process within the database environment. This can lead to easier maintenance and management of the data pipeline.
Cost-Effectiveness: ELT often proves to be more cost-effective because it utilizes existing database infrastructure for transformation tasks as opposed to managing separate transformation servers or expensive ETL tools.

With the vast amount of stock data I was handling, my aim was to move data into the target database as fast as possible. To achieve this goal, I needed to steer clear of using external tools for data transformation. These tools often consume significant time, resources, and involve moving large volumes of data, which could slow down the process. By utilizing ELT as oppose to ETL, data movement is reduced because it is quickly transferred to the target database for subsequent data transformation. Furthermore, it allows for easy tracking of where the data comes from and how it’s been changed over time. This information is stored directly within the database, unlike ETL. By establishing clear and transparent lineage of data, organizations can maximize the value of their data assets, mitigate risks, and drive informed business outcomes.

Data Extraction

The data extraction process, conducted using Python, relied on two freely available financial APIs: Financial Modeling Prep (FMP) and Yahoo Finance (YFinance). These APIs were instrumental in gathering the necessary financial data for analysis and processing within this system.

Blue python being used to extract financial data | Source: Author using Microsoft Designer

FMP was used to extract company information (i.e., company name, exchange, ceo, sector, industry, and market cap) which I utilized predominantly in the development of a lookup table. This table serves as a comprehensive overview of the assets present within the database, facilitating efficient and accurate data retrieval and analysis.

YFinance was used to extract historical daily price data (i.e., open, high, low, close, and volume) which serves as a foundational component for conducting comprehensive data analysis. The historical data range for each ticker was set to 10 years from the date of inquiry. This allowed me to conduct a more robust statistical analysis, identify long-term trends and patterns, including calculating historical returns, volatility, correlations, and other metrics.

Data Loading

During the data loading process, data was initially loaded into a staging area, serving as an intermediate step before being transferred to the final destination in the database.

As previously mentioned, the pipeline I designed offers the flexibility to choose between utilizing local storage or cloud storage solutions:

Local Storage: this process loads data into a designated local staging directory. Subsequently, the data is transferred to a local archived directory for safekeeping. Finally, the data is loaded into a local PostgreSQL database for storage and analysis. This sequential flow ensures data integrity and accessibility within the local environment, facilitating efficient data management and analysis processes.
Cloud Storage via Amazon Web Services (AWS): this process utilizes Boto3 to upload data into Amazon Simple Storage Service (S3) for staging. Subsequently, this action triggers an event monitored by Amazon Lambda, an event-driven service, prompting the transfer of data from S3 to the PostgreSQL instance hosted on Amazon Relational Database Service (RDS). This seamless integration ensures efficient data flow from storage to the database, leveraging AWS services for streamlined data management and processing.

Representation of AWS cloud platform | Source: Author using Microsoft Designer

To learn how to interface with S3 and PostgreSQL using Python, please check out the below articles:

Streamlining Data Transfers: Python’s Guide to Amazon S3 Object Storage

How to Interface with Amazon S3 Using Python

medium.com

PostgreSQL Integration with Python: A Simple Guide

Using Python to Communicate with PostgreSQL

medium.com

Data Quality & Integrity

Ensuring data quality and integrity is essential for informed decision-making, building trust and credibility, and enhancing operational efficiency within an organization.

The staging area plays a crucial role in enforcing the schema, ensuring data integrity and consistency before it’s integrated into the database structure. When retrieving company information or historical daily stock data from their respective APIs, incremental loading is conducted which focuses on updating or appending new data to existing data, rather than reloading the entire dataset each time.

Another crucial step in ensuring data integrity and consistency involved the meticulous design of the Entity Relationship Diagram (ERD).

An Entity Relationship Diagram (ERD) showing how the **company_info** table relates to the historical daily_stock_data table | Source: Author using Miro

The company_info table uses the stock ticker as it’s primary key for data integrity assurance, preventing duplicate entries. This primary key, having a cardinality of one-to-many, serves a pivotal role within the daily_stock_data table.

Example of data stored within the PostgreSQL database

NVIDIA Corporation (NVDA) data from the company_info table that is stored in the PostgreSQL database | Source: Author

A very small snippet of **NVIDIA Corporation (NVDA)** data in the **daily_stock_data** table that is stored in the PostgreSQL database. This data actually spans from 2014 to 2024 | Source: Author

Data Transformation

Using the stored data, I utilized dbt (data build tool) to generate supplementary data tables with broader time intervals. By leveraging the daily_price_data table sourced from YFinance, I conducted transformations to produce four additional tables encompassing weekly, monthly, quarterly, and yearly stock information. The rationale behind this approach was to ensure that the data is readily available in the required format for analysis, thereby minimizing the need for external transformations. This strategy streamlines the analysis process and enhances efficiency by leveraging the database’s capabilities to handle data transformation tasks effectively.

dbt models created from the **daily_stock_data** table | Source: Author using Miro

Stock Prediction Model

Predicting stock prices is essential for financial planning purposes, such as retirement planning or wealth management. Accurate predictions help individuals and financial advisors make informed decisions about asset allocation, saving, and investment strategies to achieve long-term financial objectives. There are several techniques to do this, but I chose Mean-Variance Optimization (MVO) because it is flexible and adaptable to various investment objectives, constraints, and preferences.

Mean-Variance Optimization (MVO)

MVO is a technique used in finance to find the optimal allocation of assets in an investment portfolio. It helps investors build portfolios that aim to maximize returns while minimizing the risk of losing money.

Here’s a simple breakdown:

Mean: refers to the expected return of an asset, which is essentially the average return you can expect over a period of time (Y-axis).
Variance (volatility): measures how much the returns of an asset fluctuate or deviate from its mean return. High variance means higher risk (X-axis).
Optimization: is the process of finding the best solution given certain constraints.

MVO aims to construct portfolios that maximize expected return for a given level of risk or minimize risk for a given level of expected return. It achieves this by identifying the efficient frontier, which represents the set of portfolios that offer the highest expected return for a given level of risk or the lowest risk for a given level of expected return. This allows investors to select portfolios that offer the best risk-return tradeoff.

Input Parameters

In order to conduct MVO on a portfolio, I developed an API using Streamlit that accepted the four key parameters:

Demonstration

The deployed application interface encompasses four key parameters alongside supplementary features. Regarding the submission of stock tickers, users are afforded the option to input them individually or opt for the convenience of uploading a CSV file containing the desired tickers. Moreover, users retain the flexibility to exclusively update the database without initiating portfolio optimization, facilitated by selecting the designated checkbox within the Database Settings section.

In this demonstration, I have selected the following options:

List of stock tickers (manually): TSLA, NVDA, META, AAPL, AMZN
Data Settings: Chose NOT to ONLY update the database (box unchecked)
Risk-Free Rate: 0.04
Shorting Allowed: True
Portfolio Optimization: Maximize Returns

Stock portfolio optimization API deployed using Streamlit | Source: Author

Results

The expected return is 46.24% (after 1 year) with a risk of 8.10% if assets are allocated as such:

TSLA: 3.52%
NVDA: 71.63%
META: -4.11%*
AAPL: 22.71%
AMZN: 6.26%

*The negative sign denotes shorting (betting against an asset)

Asset allocation results using the portfolio optimization API via Streamlit | Source: Author

The resulting plot after performing MVO on the chosen assets using the portfolio optimization API via Streamlit | Source: Author

Additionally, I have taken the initiative to provide a live demonstration of the application, accessible by clicking here 👉🏾 Portfolio Optimization Demo.

Conclusion

In the complex world of portfolio optimization, building a dependable and efficient pipeline is crucial for success. By adopting a simplified ELT architecture, I managed to cut down on data transfers, enhance data quality and integrity, and bolster data lineage prior to conducting portfolio optimization. As financial markets keep changing, the need for flexibility and innovation in handling data remains crucial. This highlights the importance of a well-designed ELT pipeline, which plays a key role in gaining a competitive edge and providing concrete benefits to all involved parties.

For more details on how this project was designed, feel free to check out my GitHub repository. You can also connect with me on LinkedIn — I’m excited to engage with fellow enthusiasts in the dynamic field of Data Engineering. Thanks for your time, and I look forward to interacting with you in the comments section.