Hissing the Python Way: Scraping ishares.com to get index constituents (=w00t!=)

10 min readJul 2, 2023

Hey psss pssssss! Yo! You was looking for free MSCI or S&P Equity Index constituents innit? Updated daily? I source me stash from the best my friend, cooked up to high standards. So huh… you interested? Read on.

-> Can’t be bothered reading? : Snatch the Excel file here ->.XLSX, <-
-> Want to see the code? : Follow this link
-> Want to try it out for yourself? : Get the Jupyter Notebook here

Where do I start with this one… My mate Charlie (← check him out if you’re looking for a quant with strong maths, R and 🍖BBQ skills🍖) wanted to generate equity trading signals with his Bayesian sentiment algorithm so… I thought about building an automated pipeline to help him out. Quite naively I told him: “Charlie, let me handle that stuff for you. I just need to make a few calls.” (API calls I mean… Not phone calls… I don’t really know anyone in the industry anymore).

Designing and testing his algorithm took him a whole dissertation; extending his framework to many other instruments was going to prove yet another difficult endeavour. Charlie needed to acquire more data to test all of this stuff, he needed… AN INVESTMENT UNIVERSE. How about… THE S&P 500 I offered? The data should be quite easy to get no? Most of that stuff should also have decent Open Source Intelligence we can mine. Little did I know what I was getting myself into…

You see, the S&P folks and other financial data vendors they wanted MY MONEY to give me index constituents. WHO DO THEY THINK CHARLIE AND I ARE? I told Charlie: “Charlie! Charlie, they want OUR MONEY” do you guys know what Charlie answered? Well… He didn’t answer anything because unlike me he has a social life and doesn’t spend his week ends on Discord, so I left him a message. I said “Charlie, we’re gonna hack our way through this”. And as I looked at the white blinking cursor of my Terminal I started wondering… WHERE CAN I GET INDEX DATA FOR FREE?

THE CALL OF THE TERMINAL — With its open prompt, like an invitation (and the neofetch ASCII Unix pr0n to show off… #shiver_me_timbers)

1. ishares.com “=w00t!=”

Aren’t the ETF guys supposed to replicate indexes? There are various ways of achieving that of course, but for plain vanilla stocks (i.e. fairly well traded ones) my guess is that physical replication has to be the default process. So with a bit of nosing arround I ended up -> HERE <- (The ishares website). They have a handy screener that enabled me to restrict the number of ETFs based on filters (i.e. removed fixed income, commodities and other alts) I also constrained the search to their Irish umbrella company because… I like Guinness (and also because UCITS regulation limits the use of derivatives hence pushing ETFs providers down the route of physical replication, thanks BursselZ).

S&P 500 “CORE” accumulating USD that’s the one I want…

I Quickly found what I was looking for, yet one question remained: DO THEY GIVE OUT THE HOLDINGS? There, on the upper right hand corner of the page, something looked… Promising.

Should you choose to click on the X, your favourite tabular data editor will open up. The data team has been kind enough to include various fields on there including the handy “Asset Class” one which will help me remove the various non equity bits (like cash) when I strip the data away.

Oh man — I can’t believe this is available for FREE

Me Mom always told me: “Franklin, when someone offers you a gift you need to say thank you”. So huh… Thanks Blackrock for giving me free equity Index membership with weights, sector, tickers, stock names and updated daily. Thanks also for giving away a free data schema I can use to pump all that stuff in a database and huh… thanks for sending money to all those poor data vendors; I know they obsess with subscriptions and/or pay as you go schemes but everyone has got the right to make an honest buck to feed the kids innit?

My investigation could have stopped there, yet there was something annoyingly lingering in the depths of my mind, flashbacks of my time as an analyst clicking on FACTSET menus, launching BLOOMBERG commands and feeling frustrated because…

MSCI DATA WAS UNAVAILABLE
JPM BOND DATA WAS MISSING
BARCAP DATA WAS... YOU KNOW.

SUBSCRIPTIONS. Why does everyone wants me to subscribe to their channels, to their ideas, to their data streams, TO THEIR VIEWS; WHY DO I NEED TO HIT THE FOLLOW OR SUBSCRIBE BUTTON?

Want to check something else? Have a look at this:

MSCI Equity indexes constituents — Yes, you can get them for free! **LAGGED BY TWO MONTHS AFTER THE LAST QUARTERLY INDEX REVIEW**…

AND THIS:

TWO MONTHS LAG — AND THEY DON’T EVEN GIVE YOU A TICKER, CUSIP, FIGI OR ISIN CODE. **THANKS MSCI**

2. requestsSSSSSSsssssss (hissing)

So… HOW DO WE GET INDEX CONSTITUENTS FOR FREE?

This is the plan:

First - we scrap the ishares screener page and get links to the ishares product pages (the pages where details of a single ETF are available)
Second - we scrap the product pages to get all sorts of data, INCLUDING THE LINK TO THE HOLDINGS SPREADSHEET.
Third — WE DOWNLOAD ALL OF THEM HOLDINGS SPREADSHEETS.

Lets get started! As I saw the ishares “screener” I feared the web page might be bloated with Javascript so quite hesitantly I click a few more buttons to see THE PAGE SOURCE CODE.

<! — COMPONENT: Product Screener/ProductScreenerLinkiShares →<noscript> ; Cheers bruv’

Oh my! What do we have here?! Why does using the screener returns the data characteristics of ALL ishares ETFs inside the webpage source code? If you just pass the URL with no parameters, you get a much different result WITH NO DATA. I ain’t no Javascript guy so don’t quote me on the following okay? I am guessing that upon landing on the page the “ETF screener script” loads up, an Ajax query (or the script itself) pumps out the whole contents from a Database request inside the page so the user can interact dynamically with the screener’s table without having to reload the site. So… I guess this could be a design choice? Maybe?

The data is a bit of the mess but seem to follow the loose configuration below. (Believe it or not some fields are missing and do not fall inside the data schema at all — US listed ETFs data seems slightly different to European listed ones — different data schema because of different teams?)

Basic Data Schema

<tr>
 <th class="header-line">Ticker</th>
 <th class="header-line">Fund Name</th>
 <th class="header-line">Share Class Currency</th>
 <th class="header-line">Share<br/>Class</th>
 <th class="header-line">Distribution<br/>Type</th>
 <th class="header-line">TER (%)</th>
 <th class="header-line">AUM<br/>(M)</th>
 <th class="header-line">As Of</th>
 <th class="header-line">Domicile</th>
 <th class="header-line">Distribution<br/>Yield</th>
 </tr>

Example of how the data is present in the whole page.

<tr>
<td class="links"><a href="/uk/professional/en/products/280503/ishares-sp-500-energy-sector-ucits-etf">IUES</a></td>
<td class="links"><a href="/uk/professional/en/products/280503/ishares-sp-500-energy-sector-ucits-etf">iShares S&P 500 Energy Sector UCITS ETF</a></td>
<td class="column-left-line">USD</td>
<td class="column-left-line">- </td>
<td class="column-left-line">Accumulating</td>
<td class="column-left-line">0.15</td>
<td class="column-left-line">1,320.92</td>
<td class="column-left-line">Jun 02, 2023</td>
<td class="column-left-line">Ireland</td>
<td class="column-left-line">-</td>
</tr>

So… where did I put my toolbox? Must have left it in me van. Can you guys see it? Oh! Just right there -> time to get the scraping kit out. Ladies and Gentlemen can I please attract your attention to the following URL (that sends to the ishares screener):

-- I have rearanged the URL a bit, normally all of this would be on one line

https://www.ishares.com/uk/professional/en/products/etf-investments#/?
productView=all
&fac=43535%7C43580%7C43581%7C43584%7C43585%7C43615
&pageNumber=1
&sortColumn=totalFundSizeInMillions
&sortDirection=desc
&dataView=keyFacts
&keyFacts=all

The URL does look like an internal API call which is hidden from you (the end user): so we’ll need to do it the dirty way and grab THE WHOLE PAGE.

import requests # <- This is where the joke comes from (requestsssss!)

# Specifying URL
url = 'https://www.ishares.com/uk/professional/en/products/etf-investments'

# Headers - Make believe we're a web browser 
headers = { "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/113.0.0.0 Safari/537.36"}

# Passing URL parameters just like for a normal API to fool the webpage
params = {
    'switchLocale' : 'y',
    'siteEntryPassthrough' : 'true',
    'productView': 'all',
    'domicile' : 'Ireland',
    'sortColumn' : 'totalFundSizeInMillions',
    'sortDirection' : 'desc',
    'dataView' : 'keyFacts',
    'keyFacts' : 'totalFundSizeInMillions',
    'showAll' : 'true'
}

response = requests.get(url, headers=headers, params=params)

# That should get you the same output than on the animated GIF.
print(response.text)

Why are we doing this? We’re interested in the web-links to the product pages, this is where we will find the spreadsheets with the holdings.

3. Fixing meself a beautiful (hissing) Ssssssssssoup

So after a little bit of html parsing with Beautiful Soup (one of the Python library to parse and extract data from html tags, classes and more) we get this -> file <- (I’ll spare you the code, if you want to check it out, you can have a look at this )

Scrapping the ishares screener — A meagre 1200 products/share classes… I’m sure we can expand on all those fields.

So what do we do now ? Well… we just ITERATE through the 1200 items and scrap the associated ETF product page (IF THE ETF IS FLAGGED FOR FULL PHYSICAL REPLICATION AND LISTED IN IRELAND -> we grab the holdings).

Now… if I was really considering a career in leeching, I would JUST PROBABLY keep all of this quiet, JUST PROBABLY set meself a scheduled CRON job or systemd service to scrap all that stuff everyday (end of day market close and do some checks on change in index membership), and JUST PROBABLY run a “TOR IP Rotation or a proxy chain” in between each “pull” to anonymize my scrapping. Me? I’m just here to fetch data for my mate Charlie.

Unlike for the screener where the HTML is a bit of a mess (because of Javascript), the product page is… Heavenly structured. I mean have look at that beauty.

Best thing? ALL OF THIS IS NEATLY REFERENCED IN THE CODE! Thanks guys.

Brilliant, quite simply amazing, love it, terrific code.

So we need to amend our data schema a little and scrap for a few additional bits, we also want to make sure the ETF we select bears the “PHYSICAL REPLICATION FLAG” so can limit nasty derivatives and ensure our holdings data match benchmark as closely as possible.

net_asset_share_class
net_asset_fund
share_class_launch_date
fund_launch_date
share_class_currency
base_currency
asset_class
ter
use_of_income
domicile
product_structure
rebalance_frequency
methodology
ucits
fund_manager
fund_house
custodian
bloomberg_ticker
isin
benchmark_index
benchmark_ticker
product_link
holdings_link

# @Blackrock - Thx for the data schema, will put it to good use.
# Huh... If you guys could let me know why there are index funds in the ETFs
# screener I'd be keen to know alright. It messed up all me HTML tagging...

4. snatching the loot

Do you guys need a break from the Soup? Yes? You… huh… wanna see some code? So how about we get the loot, and snatch all of them SHINNY TICKERS TAGGED WIV’ BENCHMARK MEMBERSHIP…. Oh yeah! First grab this -> csv file (we did all that scraping for the data inside). Then run the below snippet…

Dum dee dum dee dum — That’s a GIF, okay it is slow but it shows how gentle we are with the server

#1. Modules############################################################
import polars as pl # <- polars dataframes like pandas but cooler (i.e icy)
import requests # <- interacting with web stuff
import os # <- interacting with directories
import time # <- be gentle with the server
import random # <- random sleeping time (never do that with your own sleep)
#######################################################################

#2. Setting up fodlers...
current_directory = os.getcwd()
subdirectory_name = "holdings_download"
subdirectory_path = os.path.join(current_directory, subdirectory_name)
os.makedirs(subdirectory_path, exist_ok=True)

#3. Import the csv in a polars dataframe
df = pl.read_csv('df_scrap.csv')

#4. Tupple instanciation for filtering ETF run through physical replication.
to_include = ('Replicated','Physical Replication')

#5. Filtering the data frame
df_filtered = df.filter((pl.col('domicile') == 'Ireland') &
                              (pl.col('product_structure') == 'Physical') &
                              (pl.col('benchmark_index') != 'null') &
                              (pl.col('methodology').is_in(to_include))
                              )

#6. Pretending we're a webbrowser...
headers = { "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/113.0.0.0 Safari/537.36"}

#7. Passing holdings_links and file_name to a list of dictionaries
filtered_list = df_filtered.select(['holdings_link','file_name']).to_dicts()

#8. Snatching the loot...
for dictionary in filtered_list:
   
   response = requests.get(dictionary['holdings_link'], headers=headers)
   file_path = os.path.join(subdirectory_path, dictionary['file_name'])
   
   if response.status_code == 200:
   
      with open(file_path, 'wb') as file:
         file.write(response.content)
         print(f"File downloaded successfully: {file_path}")
   
   else:
      print(f"Failed to download file to : {file_path}")

   sleep_duration = random.randint(5, 10)
   print("Sleeping for", sleep_duration, "seconds...")
   time.sleep(sleep_duration)

Haaaaa…. All of them MSCI and S&P indexes? You get them holdings for FREE. I need to let Charlie know: “CHARLIE I GOT US THE GEAR!” I just… need to process the loot now. Huh… usual stuff, programmatically extract data from the spreadsheets and pump all that in MYSQL before makeing some API Calls with AlphaVantage (<- free) to get stock price data. May be that could be my next HISSING LIKE A PYTHON article? What you say? Can you think of a proper SQL Schema for me database?

Hissing the Python Way: Scraping ishares.com to get index constituents (=w00t!=)

1. ishares.com “=w00t!=”

2. requestsSSSSSSsssssss (hissing)

3. Fixing meself a beautiful (hissing) Ssssssssssoup

4. snatching the loot

Written by Franklin Schram