Finding the best Iranian mutual fund to invest your money — Part 2: Python and Flask

Fariba Hashemi
9 min readJun 27, 2020

--

Photo by rupixen on Unsplash
  • This is the second part of “Finding the best Iranian mutual fund to invest your money”. Here is the link to the first part.

In the previous part, the data of different mutual funds were crawled. Now, it is time to explore the data to find interesting information.

Keep in mind that doing a data science project has a workflow in which there is no strict boundary between the steps. It is possible to modify a step or add another step to the workflow after processing the data. For instance, in the data exploration step, we will need to do some feature engineering. In fact, it is an iterative procedure.

At first, it is necessary to know our data to have a better understanding of how it can help us to make our decision. Let’s start!

Understanding the Data

The first step is importing the data which was crawled by our spider from the FIPIRAN website. Python’s Pandas library is really helpful for this purpose. It provides reading data from different types of resources: local CSV or JSON files, and SQL tables. If you can remember, in part 1, we stored the data in a CSV file (You can download the data from this link the last crawling date: 2019-10-7).

import pandas as pd
data = pd.read_csv("mutual_fund.csv")

It is important to know what kind of information could be retrieved from the data, so we start to get familiar with it by taking a quick look at its content.

numberOfSamples = data.shape[0]
numberOfFeatures = data.shape[1]
print( "- Number of data points: {0} \n- Number of features: {1}" .format(numberOfSamples, numberOfFeatures))

There are 191 mutual funds and 29 features for each of them. See how each row of the data looks like:

data.head(5)

The next step is to get information about each column of the DataFrame:

data.info()
  • The red rectangle shows that the index range is from 0 to 190 (zero-based).
  • The purple arrows point to the columns which have some null or nonevalues. For instance, the “Guaranteed Liquidity” column has 158 non-null values; that means there are 33 (191–158 = 33) rows that have null as their value for this column.
  • The blue rectangle demonstrates that one column is float64, the types of three columns are int64, and 25 columns have object types.

We have the following information about each mutual fund (29 columns):

data.columns

Exploratory Data Analysis (EDA)

This step contains techniques to prepare data to use in the next phase. Since most of the time, the real data is inconsistent or incomplete, EDA is considered as an important phase in a data science workflow. It includes exploring, cleaning the data, and visualization. Do not forget:

Garbage In, Garbage Out (GIGO)

If the input data is not preprocessed and has inconsistency, the extracted information from that and the whole process will not be valid and useful.

As a new investor in stocks, it seems to me that “Guaranteed Liquidity” and “Type/Size” are the main factors. Let’s take a quick look at them.

  • Type/Size: These mutual funds are in different types and sizes. To understand how many types/sizes they have, we should get the unique values of the column:
data['Type/Size'].value_counts().sort_index()
The Type/Size values that each mutual fund could have only one of them

Each line of the output gives us this information: The size of mutual fund / the type of mutual fund (space) the number of mutual funds in this group. For instance, there are 65 mutual funds whose “Size” is Largeand “Type” is Fixed Income. Hence, the possible values for this column are as below:

Type: [Fixed Income, In Stock, Mixed]/Size: [Large, Small]

The below bar chart shows the same information, the distribution of mutual funds based on their Type/Size, in percentage:

The distribution of different Types and Sizes in mutual funds dataset.

I want to show you an interesting observation! First, it is better to separate each of the following columns into two columns:

  • Type/Size“Type” and “Size”
data[['Type','Size']] = data['Type/Size'].str.split(expand=True,pat='/')
  • The number of individual investors-percent owned → “The number of individual investors” and “The owned percentage by individual investors”
  • The number of institutional investors-percent owned → “The number of institutional investors” and “The owned percentage by institutional investors”
data[['The number of institutional investors','The owned percentage by institutional investors']] = data['The number of institutional investors-percent owned'].str.split(expand=True,pat='-')data[['The number of individual investors','The owned percentage by individual investors']] = data['The number of individual investors-percent owned'].str.split(expand=True,pat='-')

↪ Theexpand parameter: A boolean value which will return a data frame if it is True, otherwise, it returns a series with a list of strings.

pat: String or regular expression to split on. If this parameter is not specified, it will split on whitespaces on default. Here we want to split on /and -, respectively.

After separating these columns, we can remove redundant columns:

data.drop(columns =["The number of institutional investors-percent owned","The number of individual investors-percent owned"], inplace = True)

inplace : If True, do the operation in place and return None, otherwise performs the operation and returns a copy of the object.

Then, these columns should be converted from Object type to int64:

data[['The number of institutional investors','The number of individual investors']] = data[['The number of institutional investors','The number of individual investors']].applymap(lambda num: eval(num))

eval: This method parses the expression passed to it and runs python expression(code) within the program.

It means that if we pass a number as a string, the result of eval function on it will be that number, for example:

Now look at the following bar chart:

It shows how many investors invest their money in each type of mutual fund. We have two types of investors: Individual or Institutional. The blue and red bar charts are related to individual and institutional investors, respectively. Although the total number of mutual funds In Stock (40+53 = 93) is greater than the Fixed Income (65+12 = 77), it is obvious that most of the investors did not dare to put their hard-earned money into mutual funds with higher risks (in either stock or mixed type). The pie chart illustrates this more clearly:

  • Note: The numbers do not imply the percentage of unique investors. An investor can purchase several shares from different types of mutual funds.

More than %90 of investors preferred to invest in Fixed Income mutual funds while only less than %1 of individual investors and %10 of institutional financiers were brave enough to take more risks. This is happening while the profit of Fixed Income mutual funds is much less than the others as will be shown.

Before comparing the profit rates, it is needed to perform some pre-processing operations. The data type of columns related to the profit rates for different periods of time is Object. So, they must be converted to the float64. This can be done by Python’s eval function.

eval_cols = ['Last 365 days (%)','Last 180 days (%)',
'Last 90 days (%)','Last 30 days (%)',
'Last 7 days (%)','Last 1 day (%)']
data[eval_cols] = data[eval_cols].applymap(lambda num: eval(num))

The following table presents the average performance for each type within five-time intervals.

The In-Stock mutual funds have the lead!

If an investor puts his money into In Stock mutual funds for 3 months, on average the earned profit would be greater than investing in the Fixed Income for 1 year. This is incredible! These two charts show that the majority of financiers prefer to invest their money in Fixed Income rather than In Stock or Mixed sources, and as a result, there will be less return.

Most people (including me!) choose low-risk mutual funds for investment since they have the fear of losing their money instead of earning a profit. So, having a reliable guarantee is considered as an important feature.

If we look at the “Guaranteed Liquidity” column, it seems that different words and phrases (such as ‘no’, ‘no? no’, ‘Unspecified’, and so forth) were used to show that there is no guarantee. The below picture illustrates some of them:

It is essential to unify all of these phrases to only one for resolving the inconsistency. These values, which imply no guarantee, will be replaced by None:

no_guarantee_list = ['no', 'no? no', 'guarantor',
'Liquidity Guarantee','no -','no no',
'Unspecified',
'This type of fund does not exist',
'Uncertain Uncertain','unspecified']
data.loc[data['Guaranteed Liquidity'].isin(no_guarantee_list),'Guaranteed Liquidity'] = "None"

This column has NaN as the value for several entries:

data['Guaranteed Liquidity'].isna().any()

True

They also should be replaced by None:

data['Guaranteed Liquidity'] = data['Guaranteed Liquidity'].fillna("None")

Currently, it is not important which bank or institution gives assurance for each of mutual funds, just having guarantee matters. It leads us to add the new feature “has_guarantee” which is False if “Guaranteed Liquidity” is "None", it is True otherwise.

data['has_guarantee'] = data['Guaranteed Liquidity']!="None"

Now we can perform more exploration! In this dataset, the distribution of mutual funds with or without guarantee is as below:

More than half of mutual funds do not have any guarantee to give back the money to investors

It can be seen that there is no guarantee for most of the mutual funds. I was curious to know the percentage of mutual funds in each type which will give an assurance to return your initial money if it is requested.

Although only %33.77 of Fixed Income mutual funds give an assurance to give back the initial invested money, this percentage is more than the other types. Hence, this is why people do not take a risk and prefer to invest their money in mutual funds with lower risk and lower profit.

Money without financial intelligence is money soon gone ― Robert T. Kiyosaki, Rich Dad, Poor Dad

As I mentioned earlier, the goal of this article is to find the best Iranian mutual fund In Stock for beginners in this area who want to take more risk than investing in Fixed Income ones, but not too much risk! So, we want to choose the mutual fund which gives a guarantee and also makes more profit for a long time (more than five years).

Criteria

In the stock world, there are many parameters to evaluate the performance of each share such as Total Return, Beta, Year Return. However, I used a simple formula to choose the best mutual fund since it needs more professional information about other criteria and find out the relationship of them with the stock’s performance (I would like to cover them in the future). The assessment is based on the increased amount of the initial investment that an investor invests in a particular mutual fund. The following steps are performed to compute the interest rates of each mutual fund as a mean to compare them:

  • First, compute the interest of one unit: The redemption price — the bid price
  • Then, calculate the number of units that the investor could buy with their initial invested money
  • Finally, multiply the number of bought units by the interest of one unit to calculate the total interest

By doing this, we can find out the total interest in different mutual funds in the given time period.

Final point

Try to know the data before utilizing an impressive complex algorithm.

From the outside, data science is often thought to consist wholly of advanced statistical and machine learning techniques. However, there is another key component to any data science endeavor that is often undervalued or forgotten: exploratory data analysis (EDA). At a high level, EDA is the practice of using visual and quantitative methods to understand and summarize a dataset without making any assumptions about its contents. It is a crucial step to take before diving into machine learning or statistical modeling because it provides the context needed to develop an appropriate model for the problem at hand and to correctly interpret its results — Ref.

--

--