Finding the right mutual fund using Spark ML

Lijo Abraham
7 min readJan 31, 2023

--

Mutual fund recommendation using Spark ML

Photo by Edu Grande on Unsplash

About mutual funds

Mutual funds are a popular investment option for those looking to grow their money over time. They are a type of professionally managed investment vehicle that pools money from multiple investors to purchase a diversified portfolio of securities.

One of the main benefits of investing in mutual funds is that they offer diversification. By investing in a mutual fund, means you are effectively buying a small piece of a large portfolio of securities, rather than putting all your money into one individual stock or bond. This can help to spread out risk and potentially reduce the overall volatility of your portfolio.

There are two main types of mutual funds: equity funds and debt funds. Equity funds invest primarily in stocks, while debt funds invest primarily in bonds and other fixed-income securities. Debt mutual funds are a popular investment option for those looking for a steady income stream and low risk.

We are focusing more here on Debt mutual funds. Selecting the right debt mutual fund mainly depends on the duration of the investment and diversification. In India, we have almost 16 categories and 250+ debt funds which will primarily be invested in different duration and different portfolios of bonds across sectors, credit ratings, and maturity profiles.

Background

This was one of my side projects which was previously done partially and now have decided to build it into a fully functional system all using open-source software/products. We can also reuse this project for various use cases like building our own mutual fund portfolio system.

Overview

We will be building a system that will collect different data points, analyze them and visualize them in a dashboard. Below are the tools/system used for this project.

  • Data point collection — Scraping using python (Beautiful soup)
  • Analyzing the collected data — Using Spark ML capabilities
  • Visualizing the data — Using Apache superset
  • Scheduling the whole process — Using Apache Airflow
  • Hosting — Using docker

Architecture

Image from Author

The entire system is implemented within a Docker environment. We have configured a standalone Spark cluster with a master and worker node in separate Docker containers. Additionally, we have set up Airflow within its own Docker container, with the database being Postgres, which is also in its own container. For Apache Superset, we have created three separate containers, one for Superset, one for MySQL as the database, and one for Redis, which is used as a cache.

We will be collecting data from one of the mutual fund websites using Beautiful Soup library in python and storing it in MySQL. The results are then fetched from MySQL, transformed, and analyzed using Spark ML and then loaded back to MySQL into another table with the calculated rank. The final results are then fetched by Apache superset for visualization.

Airflow setup

In airflow, we have one Dag with two tasks, one for scrapping the mutual fund data and the other for analyzing and calculating the rank using spark.

Dags

Image from Author

Tasks

Image from Author

About Mutual fund data

Below are the data points we are collecting.

{
"name": "UTI Banking & PSU Debt Dir",
"link": "/funds/23934/uti-banking--psu-debt-fund-direct-plan",
"rating": 0,
"category": "Debt: Banking and PSU",
"category_id":131,
"category_code": "DT-BK & PSU",
"asset_value": 519,
"expense_ratio": 0.24,
"fund_house": "UTI Mutual Fund",
"risk": "Moderate",
"risk_grade": "Low",
"return_grade": "Low",
"fund_growth": {
"3m": 1.83,
"6m": 3.52,
"1y": 10.57,
"3y": 7.44,
"5y": 5.6,
"7y": 6.58,
"10y": 0
},
"cat_growth": {
"3m": 1.64,
"6m": 3.07,
"1y": 3.7,
"3y": 6.0,
"5y": 6.99,
"7y": 7.4,
"10y": 0
},
"credit_rating": {
"aaa": 69.05,
"a1plus": 0,
"sov": 27.15,
"cash_equivalent": 3.75,
"aa": 0,
"a_and_below": 0,
"unrated_and_others": 0
},
"cat_credit_rating": {
"aaa": 69.05,
"a1plus": 27.15,
"sov": 3.75,
"cash_equivalent": 0.05,
"aa": 0.05,
"a_and_below": 0.05,
"unrated_and_others": 0.05
},
"top_holdings": ["GOI Sec 7.38 20/06/2027", "National Housing Bank Debenture 7.34 07/08/2025", "Power Finance Corporation Ltd SR-172 NCD 7.74 29/01/2028", "Axis Bank Ltd SR 5 NCD 7.65 30/01/2027", "National Bank For Agriculture & Rural Development SR-1 NCD 8.22 25/02/2028", "Export-Import Bank Of India SR-T-06 Bonds 7.62 01/09/2026", "Indian Railway Finance Corporation Ltd SR-124 Debenture 7.54 31/10/2027", "Small Industries Devp. Bank of India Ltd SR I Debenture 7.15 02/06/2025", "REC Ltd SR 190A Debenture 6.88 20/03/2025", "ICICI Bank Ltd SR DJU21LB Debenture 6.45 15/06/2028", "HDFC Bank Ltd Bonds/Deb 7.95 21/09/2026"],
"fund_risk": {
"mean": 7.19,
"std_dev": 3.43,
"sharpe": 1.04,
"sortino": 3.59,
"beta": -2.22,
"alpha": 1.06
},
"category_risk": {
"mean": 5.76,
"std_dev": 2.19,
"sharpe": 1.01,
"sortino": 2.02,
"beta": 1.82,
"alpha": 4.2
},
"portfolio_agg": {
"num_securities": 23,
"modified_duration": 3.33,
"average_maturity": 4.12,
"ytm": 7.39,
"avg_cr": 0
},
"cat_portfolio_agg": {
"num_securities": 58,
"modified_duration": 2.07,
"average_maturity": 3.16,
"ytm": 7.29,
"avg_cr": 0
}
}

Calculating Rank

There are certain data points we should consider while selecting a debt fund (calculating a rank). There are much more data points we should consider, but for now, we will select from the data we have.

  • Fund’s credit quality: Fund should invest in debt instruments with high credit ratings to minimize the risk of default.
  • Fund’s duration: Duration measures the sensitivity of a bond’s price to changes in interest rates. Longer-duration bonds are more sensitive than shorter-duration bonds. We are calculating the rank within the category, so no need to worry about this.
  • Fund’s expense ratio: The annual fee the fund charges for management and other expenses. A lower expense ratio means more of your money is invested, rather than being used to cover expenses.
  • Fund’s performance history: Look at the fund’s past performance to get an idea of how it has performed in different market conditions.
  • Diversification: Diversify the investment across sectors, maturity, and credit rating.
  • Total Asset Value: A fund with a higher TAV typically has a larger and more diversified portfolio, which can be an advantage in terms of risk management
  • Sharpe ratio: Sharpe ratio is a measure of risk-adjusted return, which compares the returns of an investment to its volatility. A larger Sharpe ratio is generally considered to be better
  • Alpha: It is a measure of a mutual fund’s or portfolio’s performance in relation to a benchmark index. Larger alpha is generally considered to be better
  • Beta: It is a measure of a mutual fund’s or portfolio’s volatility in relation to a benchmark index. A lesser beta is generally considered to be better
  • Standard Deviation: It is a measure of the volatility or risk of a mutual fund. It is a statistical measure that shows how much the return on investment varies from its average return over a certain period of time. A lesser Standard deviation is generally considered to be better
  • Mean rank: It is a measure of its relative performance compared to other funds in the same category. A larger mean rank is generally considered to be better.

Using these data points we will be calculating a rank for each fund within the category to select the best fund within the category.

Visualization and Dashboard

After calculating the rank using Spark ML, we will be displaying the details of the funds in a dashboard created in Apache superset.

Dashboard — Homepage

Image from Author

Fund detail page

Image from author

Verification of results

For comparison and verification of results, I have checked the mutual fund data on Value Research website.

Let’s take the category, Debt: Banking and PSU — in value research website finder, ABSL Banking & PSU Debt Dir is the best fund in this particular category according to the ranking and other data points.

Image from Author

Let’s see the data for the same category in our dashboard. We will filter based on the category from the left side filters and check the All funds table.

Image from Author

Not bad. As we can see per our logic also we are getting the same fund with the highest total rank.

Let’s take another category, Debt: Dynamic Bond — in value research website finder, ABSL Dyn Bond Dir is the best fund in this particular category according to the ranking and other data points.

Let’s see the data for the same category in our dashboard. As we can see per our logic we are also getting the same fund with the highest total rank.

Note — This is not a foolproof method for discovering the top mutual funds, but a proof of concept for finding the best funds using the available data.

Code can be found in the Github link

--

--

Lijo Abraham

Just another engineer who loves exploring new technologies.