Extracting Yahoo Finance Stock Data: Building a Simple ETL Script

Saurabh Joshi
Plumbers Of Data Science
4 min readJul 24, 2023

In today’s fast-paced financial landscape, data plays a pivotal role in driving informed decision-making and delivering timely insights. As financial markets continue to evolve, the need for accurate, reliable, and up-to-date information becomes increasingly critical.

In this blog, we will delve into the workings of Yahoo Finance’s historical data and explore Extract, Transform, Load (ETL) script that powers this pipeline.

Pre-requisites

Yahoo Finance API
Yahoo Finance is a comprehensive online platform that provides users with a wide range of financial information, including real-time stock quotes, news, portfolio management tools, and historical data, empowering individuals to make informed decisions in the world of finance.

Requests library
Requests is an HTTP client library for the Python programming language. It allows us to send HTTP requests using Python. The HTTP request returns a Response Object with all the response data.

Why do we need an ETL pipeline setup 🤔?

In today’s digital age, data is a valuable asset for businesses. ETL, which stands for Extract, Transform, and Load, is a crucial process that helps companies make sense of their data. It allows organizations to:

  • Extract data from various sources, such as databases, websites, and files.
  • Transform the data into a usable format by cleaning, filtering, and reorganizing it.
  • Load the transformed data into a centralized system for easy access and analysis.

Project Setup

Follow the GitHub link, https://github.com/Mega-Barrel/yfin-etl, and clone the repository. Once this step is done, follow the process as mentioned in the README.md file.

ETL pipeline:
The project’s tech stack includes Python, SQLite, Pandas, and the requests library. The Logger module is also used to log all background processes running.
Before starting, make sure you have the following things installed on your system.

  • Python (latest version)
  • VS Code or any Text Editor / IDE
  • Git configured on your local system.

Extract
Before starting with any sort of analysis, data is very important! The data is collected from Yahoo Finance ( yfinance package). For the analysis, we are interested in downloading historical data for the ticker specified.

Extract function.

The above script downloads historical data for the specified ticker into a Pandas Data Frame and returns the ticker_df.

Transform
This is where all the magic happens. Once data is received from the extract function, we calculate the previous day’s adjacent close %. To calculate “% Change” we simply divide the current day value by its previous day’s value, and round it off by 4 digits decimal place.

Transform function.

What if there’s no data passed to the transform function 🤔? To handle this case, it first checks if the Data Frame is not empty, if an empty Data Frame is found, it skips the transformation steps, and the process continues, and a no-load operation is performed.

Load
Once the transformation process is completed, and there is some data inside the Data Frame, it’s time for loading the modified Data Frame to an SQLite database. For this project, we save the data locally.

Load function

In the above code, the load function calls the “insert_ticker_data(self.symbol, data_frame)” function which is the YfinDB class method.

Overview of sqlite.py file

Misc

Logger Module
The logger module is implemented to capture and store essential information regarding data extraction, transformation, and loading processes. It provides detailed logs for each step, including timestamps, data source, transformation’s applied, and any encountered errors, facilitating effective monitoring and troubleshooting during the ETL workflow.

Logger configuration

Decorator — time_it
The time_it function is a Python decorator that is applied to functions to log the total time taken for their execution. It uses the time module to measure the elapsed time and the yfin.common.yfin_logger module for logging the time information.

Congratulation, you just build yourself a Simple Yahoo Finance ETL Script, from Scratch 👏🏻🎉.

Congratulations

This was a Simple ETL pipeline. Most of the data pipelines running in real world are far more complex, and contains data quality checks, pipeline testing, etc. This article was on a beginner side, and the pipeline can be improved with implementations such as

  • Automating the daily run, using Airflow
  • Saving the data to S3 buckets, of Cloud Storage, etc.
  • A Simple UI connected with data, for basic reporting.

📔Project GitHub link: Mega-Barrel/yfin-etl: Yahoo Finance ETL script (github.com)

Thank you for reading! Keep exploring and learning. Happy journey! 🚀

--

--