Finding the best Iranian mutual fund to invest your money — Part 1: Web scraping with Scrapy in Python

Fariba Hashemi
14 min readSep 22, 2019

--

“The poor and the middle-class work for money. The rich have money work for them.”
Robert T. Kiyosaki, Rich Dad, Poor Dad

As a new data scientist, I wanted to create my portfolio. However, it was really hard to start a project from scratch without any idea about its subject! It is always recommended that the portfolio project should be something that you have a concern about it and are interested in it. Therefore, I took the advice and decided to create a project that helps my husband and me invest our money better. We rarely risk in our investment and always have the fear of putting our money in stocks till we got familiar with mutual funds. Here is the definition from Investopedia:

A mutual fund is a type of financial vehicle made up of a pool of money collected from many investors to invest in securities such as stocks, bonds, money market instruments, and other assets. Mutual funds are operated by professional money managers, who allocate the fund’s assets and attempt to produce capital gains or income for the fund’s investors. A mutual fund’s portfolio is structured and maintained to match the investment objectives stated in its prospectus.

although It seems that it has a lower risk than investing in stocks directly, I do not have any idea to choose between different mutual funds! There are some useful websites which give information about them individually, but comparing and finding the mutual fund with lower risk and higher interest is the main problem.

This project is divided into different parts. In this article (the first part) I will show you how to gather the data using Scrapy library in Python for further process in the next parts.

Part 1: Web Scrapping

At first, I needed data! Since I could not find the useful dataset, I started to scrape data and make my dataset of various mutual funds. The required information includes Name, Guarantee Liquidity, Redemption Price and Issue Price since the starting date, and so forth.

I was not sure to use which python frameworks for this purpose — Scrapy or BeautifulSoup? I found a good comparison between them in Datacamp which help me to choose Scrapy for developing the web crawlers.

Scrapy is the complete package for downloading web pages, processing them and save it in files and databases, while BeautifulSoup is basically an HTML and XML parser and requires additional libraries such as requests, urlib2 to open URLs and store the result.

  • I also used BeautifulSoup as an HTML parser in another part of the project as an HTML parser.

Install Scrapy

To start using Scrapy, you can install it using Python Package Installer pip :

pip install scrapy

Scrapy Shell

The Scrapy Shell is provided as an interactive shell for testing purpose; where you can try and debug your scraping code very quickly, without having to run the spider. To lunch the shell, use the following command:

scrapy shell

Using the command line, you will see something like this after the above command:

The output of scrapy shell command

Consider Financial Information Processing of IRAN (FIPIRAN) as the target website for scrapping. Now, you can get data from the web page using fetch command.

fetch('http://fipiran.com/Fund/MFAll')

The spider or crawler extracts corresponding data and metadata of the given URL and returns them as a response object which can be used with the following commands:

  • view(response) : Open the given URL in your local browser.
  • print(response.text) : Print HTML source code of the web page.
The result of the returned response by fetch method

Usually, we do not need all of this data and just some of them have value to be extracted. For instance, if you want to get a list of mutual funds names (the text inside the blue rectangles), the following steps should be taken (The numbers show the corresponding order of steps in the blew image):

1 — Find the table which contains the required data (for example with the name of the class @class )

2 — For each row in that table(tr)

3 — Select the second cell (column) (td[2])

4 — Get corresponding hyperlink (the <a> tag)

5 — Select the text inside that with text()

Extract elements by XPath selector

XPath is a language for selecting nodes in XML documents, which can also be used with HTML.

For navigating to HTML documents using XPath, it is better to get familiar with its Nodes and Syntax. Let us continue the previous example of extracting the names of mutual funds with XPath selector. For this purpose, here is some useful syntax of XPath:

  • // : Selects nodes in the document from the current node that match the selection no matter where they are.
  • * : Matches any element node
  • [] : Predicates which are used to find a specific node or a node that contains a specific value are always embedded in square brackets.
  • @ : Selects attributes
  • / : Selects from the root node

So, for selecting the mutual funds’ names, we should use the following command:

response.xpath('//*[@class="table table-striped tablesorter"]//tr/td[2]/a/text()').extract()

We want to get the text inside the <a> tag, which is the child node of <td> tag. Also, its ancestor is a node with<tr> tag which is the child of a table having class table table-striped tablesorter. The result will be something like the following image:

Extraction the list of mutual funds names

Creating a New Scrapy project with custom Spider

As mentioned before, the Scrapy shell is often used for testing purpose. In this project, we need to compare different mutual funds, therefore, we should write a spider to crawl their website and store extracted data in a CSV file.

For creating a new Scrapy project, go to the desired folder and use the following command in the command line (mutual_fund is the name of the project):

scrapy startproject mutual_fund
The result of starting a new Scrapy project.

It will create a directory with the following contents:

Contents of the Scrapy project.
mutual_fund/     # project's Python module, you'll import your code from here  
__init__.py # Initialization file
items.py # project items definition file
middlewares.py # project middlewares file
pipelines.py # project pipelines file
settings.py # project settings file
scrapy.cfg #deploy configuration file
spiders/ # a directory where you'll later put your spiders
__init__.py #Initialization file

Now, we could create a custom Spider using scrapy genspider <name> <domain> as follow:

Create a new spider

This will create a new template spider, named mutual_fund_data , under the spiders/ directory of the project. The basic template will be the same as the below:

The template of a new spider.

The template has some attributes and methods:

  • name: Identifies the Spider. It must be unique within a project.
  • allowed_domains(optional) : A list of strings containing domains that this spider is allowed to crawl. Requests for URLs not belonging to the domain names specified in this list (or their subdomains) won’t be followed if OffsiteMiddleware is enabled. Since each mutual fund has its website with a different domain, we do not use this attribute.
  • start_urls : A list of URLs where the spider will begin to crawl from when no particular URLs are specified. So, the first pages downloaded will be those listed here. The subsequent Request will be generated successively from data contained in the start URLs.
  • parse(response) : This method is in charge of processing the response and returning scraped data and/or more URLs to follow.

Let us back to our project! First, we want to extract each mutual fund webpage. As we saw in the previous section, at the starting URL, each row of the table contains some information of one mutual fund such as name, type, guarantee liquidity, manager, and website. You can access to its equivalent HTML code by right-clicking on the row and selecting the Inspect option:

Each row of the table shows the information of one mutual fund.

We need the first link (FundDetails?regno=10765 in the above example) in the selected region for each mutual fund. Initially, in the parse method, we select each row of the table:

response.xpath('//*[@class="table table-striped tablesorter"]//tr')

Then for each row, we get the URL in the <a> tags after this <td> tag.

mutual_fund_url = row.xpath('td[2]//a/@href').extract_first()

In the next step, we could Requestfor the extracted link in the callback. Since the link in the href attribute is a relative link, we have to add it (if it is not None ) to the base URL (“http://www.fipiran.com").

if mutual_fund_url is not None:
mutual_fund_url = self.base_url + mutual_fund_url

Each item also has a specific and unique regno in its href link that will be extracted to use in the next step. We can use theparse_qs method from urllib.parse library to get components of the URL such as query, params, and etc. Theregno is stored as a first element in the query component of the URL.

parsed = urlparse.urlparse(mutual_fund_url)
fund_number = urlparse.parse_qs(parsed.query)['regno'][0]

After extracting the URL for each specific mutual fund, the next step will be to make a callback to the corresponding website. The Scrapy Request object will help us to crawl following links.

Typically, Request objects are generated in the spiders and pass across the system until they reach the Downloader, which executes the request and returns a Response object which travels back to the spider that issued the request.

scrapy.Request(mutual_fund_url, callback=self.parse_mutual_fund, meta={"item":fund_number})

The Request object has some parameters:

  • url : The URL of the link which we want to crawl in the next step
  • callback : The function that will be called with the response of this request (once it is downloaded) as its first parameter. If a Request doesn’t specify a callback, the spider’s parse() method will be used.
  • method : The method which will process the response of this request.
  • meta : A dictionary that contains arbitrary metadata for this request.

In our project, the fund_number(regno) will be sent as meta of the request, and the response of the request will be processed by parse_mutual_fund method. Therefore, the parse function should be modified like below:

Processing the crawled data in the parse method.
  • In contrast to the return statement, yield keyword will return a generator. A generator function is defined like a normal function, but whenever it is needed it will generate value. The function with yield statement continues execution immediately after the last yield run while return statement causes the function to exit and also terminate the loop. yield is used when we want to iterate over a sequence, but don’t want to store the entire sequence in memory.

After getting the specific link of each mutual fund, now the corresponding webpage should be crawled (in the callback method). FIPIRAN provides a specific page for each mutual fund which shows the information of that like below (in this image, the original page was translated to English for the better understanding):

The web page of “Kiyan Stocks Fund” provided by FIPIRAN

We are willing to gather all of the selected information.

1 — In the first part (the purple rectangle), we have to extract data from a Highchart — An interactive JavaScript charts for a webpage. The NAV (Net Asset Value) and the number of Units(y-axis) for different dates (x-axis) are shown with the blue and red diagram, respectively. Therefore, each point of the chart has two values per each date (NAV, Unit).

The NAV and Unit for the specific date (date: 09/09/1397, NAV = 3048703, Unit= 278590)

The script in the page source starts with:

The Highchart script

The date is in timestamp format, NAV and Unit values are provided in the series attribute. series is a list with two elements (one for the NAV and the other for Unit) which both are of dicttype. Each element has three attributes: data , name, and color . The data is also a dict contains X and Y values. name and color specify the data and color for NAV or Unit.

In order to get timestamps, NAV, and Units, the regular expression is needed. We search to find everything that matches this pattern: data:(.*?) }\] , it will find everything between data: and }\] in every occurrence of the pattern. The re library of python is used for this purpose. This module provides regular expression matching operations.

info_series = re.findall('data:(.*?) }\]',response.text)

Regular expressions use the backslash character ('\') to indicate special forms or to allow special characters to be used without invoking their special meaning.

It will find two matches, a list of x,yfor NAV and another for Unit. The type of each match is str , that is a string representation of a list. Here is an example of info_series[0][0:100] :

'[{ x: 1395862200000, y: 1048079 }, { x: 1395948600000, y: 1048662 }, { x: 1396035000000, y: 1049084'

However, it is needed to convert this string to a valid list of dictionaries. Do not worry! Python’s ast (Abstract Syntax Tree) module provides literal_eval method to safely evaluate an expression node or a string containing a Python literal or container display.

Before converting step, we should prepare our string to have a valid Python syntax for the list of dictionaries. Since the keys of dictionaries are of type str , we should put them between single or double-quotes.

We also eliminated the last }] characters from the whole string for finding the pattern. So, it is necessary to add them before the next step.

The function to prepare and convert the string representation of a dictionary to a dict
Convert str data of NAV and Unit to valid Python lists

Here are some redundant data; for each timestamp, there are a NAV value and number of Units. So, we store the timestamp twice: as x variable in both NAV and Unit lists. It is a good idea if we combine this information to have aggregated data such as blew:

{timestamp:{'NAV' : NAV_Value, 'Unit' : Unit_Value}}

  • First, combine NAV_series and Unit_series together such that each element of the tuple have the same timestamp.

((timestamp, NAV_Value), (timestamp, Unit_Value))

We make the use of zip function in Python.

Zip Make an iterator that aggregates elements from each of the iterables. Returns an iterator of tuples, where the i-th tuple contains the i-th element from each of the argument sequences or iterables.

NAV_Unit = list(zip(NAV_series,UNIT_series))
  • Second, remove one of the timestamps (x) from them. For this purpose, the timestamp will be kept as a key of the dictionary. Using Python dictionary comprehension will be useful here:
info = {nav_unit[0]['x']//1000:{'NAV': nav_unit[0]['y'], 'UNIT':nav_unit[1]['y']} for nav_unit in NAV_Unit}

Then, we should make a callback to the other link which will be explained in the next section for gathering the fund specification.

All in all, the final parse_mutual_fund will be the same as below:

The parse_mutual_fund function

2 — For the second part (the red rectangle), the fund specification is obtained after sending a GET method to the other link which contains fund_number (regno) . Here is the script of that in the page source.

The script of getting Kiyan Stocks Fund specifications with regno=’11477’.
  • That is the reason for making a Request in the last line of the parse_mutual_fund function. All other processed data related to each mutual fund will be sent as meta argument in the Request object.

We have to send a GET request to the following link with adding fund_number(regno) to the end of that:

http://www.fipiran.com/Fund/MFwithRegNo?regno=

The response for Kiyan Stocks Fund with regno=11477 will be:

The response of GET request for getting the fund specification

As it is represented with blue rectangles, we need to inspect this HTML response for gathering the required information. It requires to parse the HTML source of the response which will be done by the Beautiful Soup library.

Beautiful Soup is a Python library for pulling data out of HTML and XML files.

We should import bs4 from BeautifulSoup library and create a new BeautifulSoup object as a Python’s html.parser. Then, we tell that to find all elements with div tag and class=col-md-4 col-xs-12 col-sm-4 col-ms-12.

soup = BeautifulSoup(response.text, 'html.parser')
divs = soup.findAll("div", {"class": "col-md-4 col-xs-12 col-sm-4 col-ms-12"})

Now, a dict is needed to store extracted data as key-value pairs.

We also will perform some post-processing for removing \u200c which is added to some of Persian Characters after adding this information to a list (The reason explained in this link):

The dirty crawled data!

We take advantage of making a list by Python’s list comprehension feature:

fund_info = {div.contents[0].replace('\u200c', '').replace(" : ",""):div.contents[1].text.strip() for div in divs}

Finally, all of the crawled data will be stored in a .csv file by following format for each mutual fund:

{"fund_number":response.meta["fund_number"], **fund_info, "nav_unit":response.meta["nav_unit"]}
  • **: This is generally considered a trick in Python where a single expression is used to merge two dictionaries and stored in a third dictionary. See the following example:
The example of using * and ** for unpacking iterable and dictionary, respectively. [Source]

According to PEP 448, in Python * is iterable unpacking operator and ** is dictionary unpacking operators to allow unpacking in more positions, an arbitrary number of times, and in additional circumstances. Specifically, in function calls, in comprehensions and generator expressions, and in displays.

To sum up, the parse_mutual_fund_specification method will be as below:

The parse_mutual_fund_specification function to process the fund specification data

Storing data

For further analysis of crawled data, it is needed to export data in a database or as a CSV or JSON file.

  • To save data in CSV format, open setting.py file from the main directory of the project and add the following lines to the end of that:
FEED_FORMAT = "csv"
FEED_URI="mutual_fund.csv"
  • To save data in JSON format, add the following lines at the end of the same file:
FEED_FORMAT = "json"
FEED_URI="mutual_fund.json"
  • Note: Whenever you run the spider it will append at the end of the file, it will not create a new one
  • FEED_FORMAT: The format to be used for storing the crawled data. The possible values: JSON, JSON lines, CSV, and XML.
  • FEED_URI : This gives the location of the file. You can store a file on your local file storage or an FTP as well.

It will add a file (with specified format) in the given URI.

Final Point

Scraping the data is not something general and you can not write a spider to get information from all websites! The reason is every website has its own structure. Keep in mind that while the crawling data can be a lot of work, the extracted data has not been cleaned enough to perform analysis on it.

It was the first part of my “Finding the best mutual fund to invest your money” project. In which, I tried to gather the required data of different mutual funds by scraping data using the Scrapy library in Python. I will publish further parts in the near future to find out which mutual fund is better to invest our money in. The source code could be found in my GitLab repository.

Create your own dataset

It was my first scraping experience! I would like to start my own project but I could not find any dataset in Iran of existing mutual funds to compare them. Therefore, I decided to crawl the required data by myself and it was an interesting experience for me.

“The question you should be asking isn’t, “What do I want?” or “What are my goals?” but “What would excite me?”– Tim Ferriss

--

--