Exploring the Similarity Between ETFs

Published in

INST414: Data Science Techniques

6 min readMar 28, 2024

Exchange-traded funds are investment vehicles that usually contain a diverse set of investments, they enable investors to manage their risk because ETFs give investors a choice to have at least part of their portfolio track a collection of securities. For example, the largest ETF by assets under management is the SPDR S&P 500 ETF Trust (symbol: SPY). The SPY ETF seeks to track the performance of the S&P 500 which is one of the most popular stock market indices. Although the SPY ETF is the largest ETF, there are other ETFs that also seek to track the performance of the S&P 500. When deciding on which ETF to invest in such as the SPY ETF, investors may want to know what other ETFs there are that, for example, offer similar returns and may also have better features such as lower expense ratios, higher dividend yields, and less risk. An investor’s time horizon may also impact their decision on which ETF to invest in. For example, an investor seeking to buy and hold an ETF for many years could be more interested in investing in an ETF with more volatility than the S&P 500 and a lower expense ratio, and an investor with a shorter time horizon may be more interested in a less volatile ETF.

The source of the data that I collected is etfdb.com which is a website that hosts an ETF database. To obtain the data from the website, I searched for the “Large-Cap” ETF theme under “Asset Class Size” in their ETF directory and then exported the dataset to a CSV file. The dataset contains 921 ETFs and they are sorted by the total amount of assets held by each ETF in dollars in descending order.

The fields of the dataset that I will be using when measuring similarity are ‘Total Assets,’ ‘YTD Price Change.’ ‘Avg. Daily Volume,’ ‘Expense Ratio,’ ‘Beta,’ and ‘# of Holdings.’ The ‘Total Assets’ feature describes the value of the assets held by an ETF, it is relevant because the number of assets held by an ETF scales with its popularity and investors may want to invest in more widely-known ETFs for multiple reasons including higher liquidity and more familiarity with what they’re investing in. The ‘YTD Price Change’ feature displays the performance of an ETF from the beginning of the year to the current date, it is relevant because an investor may base their decision on which ETF to invest in depending on an ETF’s recent performance. The ‘Avg. Daily Volume’ feature displays a measure of the average number of shares traded every day, it is relevant because this field is a measure of liquidity and investors may be interested in ETFs that allow investors to more easily buy or sell shares. The ‘Expense Ratio’ feature represents the fee paid to invest in an ETF, it is relevant because an investor might want to avoid ETFs with high expense ratios to decrease the cost of investing in an ETF. The ‘Beta’ feature displays a measure of the volatility of an ETF compared to the overall market, it is relevant because an investor may either want to invest in more volatile or less volatile ETFs depending on how much risk they are willing to take on for potentially better returns. The ‘# of Holdings’ feature displays the number of securities held by an ETF, it is relevant because a higher number of securities held by an ETF indicates how well diversified an ETF is. Each of these features contribute to the overall similarity assessment by enabling the comparison of ETFs based on why an investor may choose to invest in a particular ETF. For example, investors who are interested in investing in ETFs containing a large number of holdings to easily diversify their portfolio to manage risk might be interested in finding ETFs that are similar to the Vanguard Total Stock Market ETF. To find similarity, the similarity metric I will use will be Euclidean Similarity.

The first ETF that I chose to query was the SPDR S&P 500 ETF Trust. The ETFs that are most similar to SPY are IVV, VOO, QQQ, VTI, VUG, VTV, IEFA, XLF, VEA, and IWF. The two most similar ETFs to SPY are ETFs that also track the performance of the S&P 500.

The second ETF that I chose to query was the Technology Select Sector SPDY Fund which is an ETF that tracks the performance of the technology sector in the S&P 500 index. The ETFs that are most similar to XLK are SCHD, VGT, IVW, QUAL, VIG, XLV, VYM, SPLG, IWF, and SCHG. What is interesting about these results is that the ETF that is most similar to XLK is the Schwab US Dividend Equity ETF which is an ETF that tracks the Dow Jones U.S. Dividend 100 Index and not another ETF that tracks stocks in the technology sector. The second most similar ETF is the Vanguard Information Technology ETF which is an ETF that tracks technology stocks and specifically stocks related to the information technology sector.

The third ETF that I chose to query was the Vanguard High Dividend Yield Index ETF which is an ETF that tracks the FTSE High Dividend Yield Index. The ETFs that are most similar to VYM are SCHD, VV, SCHX, VIG, VGT, DGRO, IVE, QUAL, IWD, and SPYV. The VYM ETF seeks to provide its investors a high dividend yield. The most interesting thing displayed by the results is that the most similar ETF to VYM is SCHD which is also an ETF that seeks to provide its investors with a high dividend yield.

The software that I used to facilitate my analysis includes Pandas to develop dataframes, and Scipy to measure the similarity between the queried ETFs and all of the other ETFs in the dataset.

To clean the data I developed a dataframe from the dataset contained in the CSV file and then copied all of the useful features of the original dataset to a new dataframe. By doing this, I ensured that only columns that are necessary to find similarity would be included. For example, some of the columns that I did not include were ‘Asset Class,’ ‘Previous Closing Price,’ ‘Lower Bollinger,’ and ‘Support 1.’ The columns that were not included likely would not help find similarity. Then I renamed the ‘Expense Ratio’ and ‘YTD Price Change %’ columns to make them easier to read. I removed all of the rows from the dataset that contained NA values. Then I removed special characters from values in the ‘Total Assets,’ ‘YTD Price Change %,’ ‘Expense Ratio,’ and ‘Annual Dividend Yield’ columns while also changing their data types to either integer or float. Then I copied all of the columns except the ‘Symbol’ and ‘ETF Name’ columns to a new dataframe and applied min-max normalization to the new dataframe. A bug I think others might encounter with this dataset is that numeric data within the dataset is not set to a numeric data type and will need to be set to a numeric data type before analyzing the data.

This analysis is limited by the amount of data contained in the dataset that was used and by what the data pertains to. The dataset could contain more ETFs which could lead to more accurate results. The analysis could have focused on other types of ETFs such as ETFs that track specific asset classes or ETFs that track specific sectors.

The dataset for this analysis can be found here:

Large-Cap ETF List

Click to see more information on Large-Cap ETFs including historical performance, dividends, holdings, expense ratios…

etfdb.com

The code for this analysis can be found here:

https://github.com/JasonRahimi2/INST414_ETFs?source=post_page-----fa263da34050

GitHub - JasonRahimi2/INST414_ETFs

Contribute to JasonRahimi2/INST414_ETFs development by creating an account on GitHub.

github.com

Exploring the Similarity Between ETFs

Large-Cap ETF List

Click to see more information on Large-Cap ETFs including historical performance, dividends, holdings, expense ratios…

GitHub - JasonRahimi2/INST414_ETFs

Contribute to JasonRahimi2/INST414_ETFs development by creating an account on GitHub.

Written by Jason Rahimi