Finding the best Iranian mutual fund to invest your money — Part 1: Web scraping with Scrapy in Python
As a new data scientist, I wanted to create my portfolio. However, it was really hard to start a project from scratch without any idea about its subject! It is always recommended that the portfolio project should be something that you have a concern about it and are interested in it. Therefore, I took the advice and decided to create a project that helps my husband and me invest our money better. We rarely risk in our investment and always have the fear of putting our money in stocks till we got familiar with mutual funds. Here is the definition from Investopedia:
A mutual fund is a type of financial vehicle made up of a pool of money collected from many investors to invest in securities such as stocks, bonds, money market instruments, and other assets. Mutual funds are operated by professional money managers, who allocate the fund’s assets and attempt to produce capital gains or income for the fund’s investors. A mutual fund’s portfolio is structured and maintained to match the investment objectives stated in its prospectus.
although It seems that it has a lower risk than investing in stocks directly, I do not have any idea to choose between different mutual funds! There are some useful websites which give information about them individually, but comparing and finding the mutual fund with lower risk and higher interest is the main problem.
This project is divided into different parts. In this article (the first part) I will show you how to gather the data using Scrapy library in Python for further process in the next parts.
Part 1: Web Scrapping
At first, I needed data! Since I could not find the useful dataset, I started to scrape data and make my dataset of various mutual funds. The required information includes Name, Guarantee Liquidity, Redemption Price and Issue Price since the starting date, and so forth.
I was not sure to use which python frameworks for this purpose — Scrapy or BeautifulSoup? I found a good comparison between them in Datacamp which help me to choose Scrapy for developing the web crawlers.
Scrapy is the complete package for downloading web pages, processing them and save it in files and databases, while BeautifulSoup is basically an HTML and XML parser and requires additional libraries such as
requests
,urlib2
to open URLs and store the result.
- I also used BeautifulSoup as an HTML parser in another part of the project as an HTML parser.
Install Scrapy
To start using Scrapy, you can install it using Python Package Installer pip
:
pip install scrapy
Scrapy Shell
The Scrapy Shell is provided as an interactive shell for testing purpose; where you can try and debug your scraping code very quickly, without having to run the spider. To lunch the shell, use the following command:
scrapy shell
Using the command line, you will see something like this after the above command:
Consider Financial Information Processing of IRAN (FIPIRAN) as the target website for scrapping. Now, you can get data from the web page using fetch
command.
fetch('http://fipiran.com/Fund/MFAll')
The spider or crawler extracts corresponding data and metadata of the given URL and returns them as a response object which can be used with the following commands:
view(response)
: Open the given URL in your local browser.print(response.text)
: Print HTML source code of the web page.
Usually, we do not need all of this data and just some of them have value to be extracted. For instance, if you want to get a list of mutual funds names (the text inside the blue rectangles), the following steps should be taken (The numbers show the corresponding order of steps in the blew image):
1 — Find the table which contains the required data (for example with the name of the class @class
)
2 — For each row in that table(tr)
3 — Select the second cell (column) (td[2])
4 — Get corresponding hyperlink (the <a>
tag)
5 — Select the text inside that with text()
Extract elements by XPath selector
XPath is a language for selecting nodes in XML documents, which can also be used with HTML.
For navigating to HTML documents using XPath, it is better to get familiar with its Nodes and Syntax. Let us continue the previous example of extracting the names of mutual funds with XPath selector. For this purpose, here is some useful syntax of XPath:
//
: Selects nodes in the document from the current node that match the selection no matter where they are.*
: Matches any element node[]
: Predicates which are used to find a specific node or a node that contains a specific value are always embedded in square brackets.@
: Selects attributes/
: Selects from the root node
So, for selecting the mutual funds’ names, we should use the following command:
response.xpath('//*[@class="table table-striped tablesorter"]//tr/td[2]/a/text()').extract()
We want to get the text inside the <a>
tag, which is the child node of <td>
tag. Also, its ancestor is a node with<tr>
tag which is the child of a table having class table table-striped tablesorter
. The result will be something like the following image:
Creating a New Scrapy project with custom Spider
As mentioned before, the Scrapy shell is often used for testing purpose. In this project, we need to compare different mutual funds, therefore, we should write a spider to crawl their website and store extracted data in a CSV file.
For creating a new Scrapy project, go to the desired folder and use the following command in the command line (mutual_fund
is the name of the project):
scrapy startproject mutual_fund
It will create a directory with the following contents:
mutual_fund/ # project's Python module, you'll import your code from here
__init__.py # Initialization file
items.py # project items definition file
middlewares.py # project middlewares file
pipelines.py # project pipelines file
settings.py # project settings file
scrapy.cfg #deploy configuration file
spiders/ # a directory where you'll later put your spiders
__init__.py #Initialization file
Now, we could create a custom Spider using scrapy genspider <name> <domain>
as follow:
This will create a new template spider, named mutual_fund_data
, under the spiders/
directory of the project. The basic template will be the same as the below:
The template has some attributes and methods:
name
: Identifies the Spider. It must be unique within a project.allowed_domains(optional)
: A list of strings containing domains that this spider is allowed to crawl. Requests for URLs not belonging to the domain names specified in this list (or their subdomains) won’t be followed ifOffsiteMiddleware
is enabled. Since each mutual fund has its website with a different domain, we do not use this attribute.start_urls
: A list of URLs where the spider will begin to crawl from when no particular URLs are specified. So, the first pages downloaded will be those listed here. The subsequentRequest
will be generated successively from data contained in the start URLs.parse(response)
: This method is in charge of processing the response and returning scraped data and/or more URLs to follow.
Let us back to our project! First, we want to extract each mutual fund webpage. As we saw in the previous section, at the starting URL, each row of the table contains some information of one mutual fund such as name, type, guarantee liquidity, manager, and website. You can access to its equivalent HTML code by right-clicking on the row and selecting the Inspect option:
We need the first link (FundDetails?regno=10765
in the above example) in the selected region for each mutual fund. Initially, in the parse
method, we select each row of the table:
response.xpath('//*[@class="table table-striped tablesorter"]//tr')
Then for each row, we get the URL in the <a>
tags after this <td>
tag.
mutual_fund_url = row.xpath('td[2]//a/@href').extract_first()
In the next step, we could Request
for the extracted link in the callback. Since the link in the href
attribute is a relative link, we have to add it (if it is not None
) to the base URL (“http://www.fipiran.com").
if mutual_fund_url is not None:
mutual_fund_url = self.base_url + mutual_fund_url
Each item also has a specific and unique regno
in its href
link that will be extracted to use in the next step. We can use theparse_qs
method from urllib.parse
library to get components of the URL such as query, params, and etc. Theregno
is stored as a first element in the query component of the URL.
parsed = urlparse.urlparse(mutual_fund_url)
fund_number = urlparse.parse_qs(parsed.query)['regno'][0]
After extracting the URL for each specific mutual fund, the next step will be to make a callback to the corresponding website. The Scrapy Request
object will help us to crawl following links.
Typically,
Request
objects are generated in the spiders and pass across the system until they reach the Downloader, which executes the request and returns aResponse
object which travels back to the spider that issued the request.
scrapy.Request(mutual_fund_url, callback=self.parse_mutual_fund, meta={"item":fund_number})
The Request
object has some parameters:
url
: The URL of the link which we want to crawl in the next stepcallback
: The function that will be called with the response of this request (once it is downloaded) as its first parameter. If a Request doesn’t specify a callback, the spider’sparse()
method will be used.method
: The method which will process the response of this request.meta
: A dictionary that contains arbitrary metadata for this request.
In our project, the fund_number(regno)
will be sent as meta
of the request, and the response of the request will be processed by parse_mutual_fund
method. Therefore, the parse
function should be modified like below:
- In contrast to the
return
statement,yield
keyword will return a generator. A generator function is defined like a normal function, but whenever it is needed it will generate value. The function withyield
statement continues execution immediately after the last yield run whilereturn
statement causes the function to exit and also terminate the loop.yield
is used when we want to iterate over a sequence, but don’t want to store the entire sequence in memory.
After getting the specific link of each mutual fund, now the corresponding webpage should be crawled (in the callback
method). FIPIRAN provides a specific page for each mutual fund which shows the information of that like below (in this image, the original page was translated to English for the better understanding):
We are willing to gather all of the selected information.
1 — In the first part (the purple rectangle), we have to extract data from a Highchart — An interactive JavaScript charts for a webpage. The NAV (Net Asset Value) and the number of Units(y-axis) for different dates (x-axis) are shown with the blue and red diagram, respectively. Therefore, each point of the chart has two values per each date (NAV, Unit).
The script in the page source starts with:
The date is in timestamp format, NAV and Unit values are provided in the series
attribute. series
is a list with two elements (one for the NAV and the other for Unit) which both are of dict
type. Each element has three attributes: data
, name
, and color
. The data
is also a dict
contains X
and Y
values. name
and color
specify the data and color for NAV or Unit.
In order to get timestamps, NAV, and Units, the regular expression is needed. We search to find everything that matches this pattern: data:(.*?) }\]
, it will find everything between data:
and }\]
in every occurrence of the pattern. The re
library of python is used for this purpose. This module provides regular expression matching operations.
info_series = re.findall('data:(.*?) }\]',response.text)
Regular expressions use the backslash character (
'\'
) to indicate special forms or to allow special characters to be used without invoking their special meaning.
It will find two matches, a list of x,y
for NAV and another for Unit. The type of each match is str
, that is a string representation of a list. Here is an example of info_series[0][0:100]
:
'[{ x: 1395862200000, y: 1048079 }, { x: 1395948600000, y: 1048662 }, { x: 1396035000000, y: 1049084'
However, it is needed to convert this string to a valid list of dictionaries. Do not worry! Python’s ast
(Abstract Syntax Tree) module provides literal_eval
method to safely evaluate an expression node or a string containing a Python literal or container display.
Before converting step, we should prepare our string to have a valid Python syntax for the list of dictionaries. Since the keys of dictionaries are of type str
, we should put them between single or double-quotes.
We also eliminated the last }]
characters from the whole string for finding the pattern. So, it is necessary to add them before the next step.
Here are some redundant data; for each timestamp, there are a NAV value and number of Units. So, we store the timestamp twice: as x
variable in both NAV and Unit lists. It is a good idea if we combine this information to have aggregated data such as blew:
{timestamp:{'NAV' : NAV_Value, 'Unit' : Unit_Value}}
- First, combine
NAV_series
andUnit_series
together such that each element of the tuple have the same timestamp.
((timestamp, NAV_Value), (timestamp, Unit_Value))
We make the use of zip
function in Python.
Zip Make an iterator that aggregates elements from each of the iterables. Returns an iterator of tuples, where the i-th tuple contains the i-th element from each of the argument sequences or iterables.
NAV_Unit = list(zip(NAV_series,UNIT_series))
- Second, remove one of the timestamps (
x
) from them. For this purpose, the timestamp will be kept as a key of the dictionary. Using Python dictionary comprehension will be useful here:
info = {nav_unit[0]['x']//1000:{'NAV': nav_unit[0]['y'], 'UNIT':nav_unit[1]['y']} for nav_unit in NAV_Unit}
Then, we should make a callback to the other link which will be explained in the next section for gathering the fund specification.
All in all, the final parse_mutual_fund
will be the same as below:
2 — For the second part (the red rectangle), the fund specification is obtained after sending a GET
method to the other link which contains fund_number (regno)
. Here is the script of that in the page source.
- That is the reason for making a
Request
in the last line of theparse_mutual_fund
function. All other processed data related to each mutual fund will be sent asmeta
argument in theRequest
object.
We have to send a GET
request to the following link with adding fund_number(regno)
to the end of that:
http://www.fipiran.com/Fund/MFwithRegNo?regno=
The response for Kiyan Stocks Fund with regno=11477
will be:
As it is represented with blue rectangles, we need to inspect this HTML response for gathering the required information. It requires to parse the HTML source of the response which will be done by the Beautiful Soup library.
Beautiful Soup is a Python library for pulling data out of HTML and XML files.
We should import bs4
from BeautifulSoup
library and create a new BeautifulSoup
object as a Python’s html.parser. Then, we tell that to find all elements with div
tag and class=col-md-4 col-xs-12 col-sm-4 col-ms-12
.
soup = BeautifulSoup(response.text, 'html.parser')
divs = soup.findAll("div", {"class": "col-md-4 col-xs-12 col-sm-4 col-ms-12"})
Now, a dict
is needed to store extracted data as key-value pairs.
We also will perform some post-processing for removing \u200c
which is added to some of Persian Characters after adding this information to a list (The reason explained in this link):
We take advantage of making a list by Python’s list comprehension feature:
fund_info = {div.contents[0].replace('\u200c', '').replace(" : ",""):div.contents[1].text.strip() for div in divs}
Finally, all of the crawled data will be stored in a .csv file by following format for each mutual fund:
{"fund_number":response.meta["fund_number"], **fund_info, "nav_unit":response.meta["nav_unit"]}
**:
This is generally considered a trick in Python where a single expression is used to merge two dictionaries and stored in a third dictionary. See the following example:
According to PEP 448, in Python * is iterable unpacking operator and ** is dictionary unpacking operators to allow unpacking in more positions, an arbitrary number of times, and in additional circumstances. Specifically, in function calls, in comprehensions and generator expressions, and in displays.
To sum up, the parse_mutual_fund_specification
method will be as below:
Storing data
For further analysis of crawled data, it is needed to export data in a database or as a CSV or JSON file.
- To save data in CSV format, open
setting.py
file from the main directory of the project and add the following lines to the end of that:
FEED_FORMAT = "csv"
FEED_URI="mutual_fund.csv"
- To save data in JSON format, add the following lines at the end of the same file:
FEED_FORMAT = "json"
FEED_URI="mutual_fund.json"
- Note: Whenever you run the spider it will append at the end of the file, it will not create a new one
FEED_FORMAT
: The format to be used for storing the crawled data. The possible values: JSON, JSON lines, CSV, and XML.FEED_URI
: This gives the location of the file. You can store a file on your local file storage or an FTP as well.
It will add a file (with specified format) in the given URI.
Final Point
Scraping the data is not something general and you can not write a spider to get information from all websites! The reason is every website has its own structure. Keep in mind that while the crawling data can be a lot of work, the extracted data has not been cleaned enough to perform analysis on it.
It was the first part of my “Finding the best mutual fund to invest your money” project. In which, I tried to gather the required data of different mutual funds by scraping data using the Scrapy library in Python. I will publish further parts in the near future to find out which mutual fund is better to invest our money in. The source code could be found in my GitLab repository.
Create your own dataset
It was my first scraping experience! I would like to start my own project but I could not find any dataset in Iran of existing mutual funds to compare them. Therefore, I decided to crawl the required data by myself and it was an interesting experience for me.
“The question you should be asking isn’t, “What do I want?” or “What are my goals?” but “What would excite me?”– Tim Ferriss