Concealed APIs — Scrap fast, scrap easy, scrap well: Morningstar and Vanguard Examples.

Franklin Schram
11 min readJul 28, 2023

--

When I started the ‘The Penniless Investor’ series, I wanted to showcase (among other things) how it is possible to find free high quality financial data online to make informed investment choices. This quest has reached a new milestone with a technique that will quite simply leave you “overflowing” with data. From fund ratings and benchmark holdings to tickers and performance data — I will show you a fast and easy way to get going and populate your own tools with quality stuff. This article is part of a series -> Start Here! <-

I wonder what the Goblin just said to the guy with the suit… I mean the bloke looks clearly annoyed innit?

In this article we will teach you how to:

  • Use your web browser to locate and sniff out concealed APIs
  • Put this into practice by scrapping the holdings of the RUSSELL 3000 Index via a Vanguard ETF.
  • See another example by scrapping Morningstar, sniffing out their extensive networks of concealed APIs.
  • Fool the Morningstar API with its own bearer token mechanism through the use of the Playwright library to get access to additional functionalities.
  • Use Morningstar’s own python library with a “custom made token generator” granting us the equivalent access to a “Direct” subscription.

Want to see some code? Follow the article with colabs as companion.

-> Vanguard code
->
Morningstar code

In a previous article, I illustrated how modern data exchange protocols employed JSON (JavaScript Object Notation) and APIs (Application programming interface) to make data available to an end user over the internet. To showcase the principles of both JSON and APIs we used the polygon.io API to get random tickers, I have made another example available using the NASDAQ Link API and the associated Python library to get corporate indexes spreads.

In the above, APIs are used as an “external service” i.e. to provide data to a “fee or non-fee paying end user”. APIs are also used as part of an “internal” data pipeline to populate contents on webpages — these APIs tend to be “hidden” from the end user whose concerns often rests with visiting a website, browsing information and then leaving the page.

Lets say I visit the Vangard ETF webpage because I am looking for… an ETF — How does the website populates all of this data? Well… WITH CONCEALED API CALLS. By understanding and parsing the requests made by your browser to a server, it is sometimes possible to “sniff out the API” framework and use it in your own data pipelines.

Why is this important? In yet another post, I went through the cheeky task of scraping ishares.com (in a rather brutish fashion…) by parsing and lifting rendered HTML code. A quick glimpse at the associated Jupyter notebook will probably leave you… perplexed about code usefulness. What if I told you there is a faster way? Note: We will use Chrome as a web browser.

1. Understanding how to use your web browser to sniff out APIs

Let’s pick another unwilling volunteer: Vanguard — I have been quite a big fan of the way the data is structured on their website which implies that the API powering all those fields in the background has the capability to pump out lots of good stuff. Lets say you’re a wealth advisor and that one of your clients hates pooled vehicles (ETFs, funds, you name it; I had the case back in my BofA days…) — The client wants you to pick stocks based on a set of “qualitative values” — How are you going to start? Use the Bloomberg stock screener? Do you have Bloomberg? I DON’T— How about we start with a LARGE index like… The Russell 3000 and have a look at what’s in there as a basis for discussion. Lets head to the Vanguard ETF screener page.

The Vanguard ETF screener — I like it, BUT I DON’T LIKE ALL THAT UGLY JAVASCRIPT GUI…

Let’s do a first test — right click any white space on the page, select “Inspect

In the opening new section select:

  • The networktab
  • Disable cache and select the Fetch/XHR flag.
  • Reload the page
  • In the window called Name (which lists all the requests sent to the server) select “funddetail”
  • Finally click on the headers tab (you could also view the JSON response by clicking the Response tab.
This is it! There is a video as well for those who like to see sequences of pictures at 30 frames per seconds rather than staring at single picture for 30 seconds.

So what is important here?

We can see which URL the browser connected various servers to “grab components” in order to construct the webpage. In this particular case we isolated the “Fetch” requests (that go and grab data from elsewhere) and looked at where they point to. Of particular importance:

  • Request headers — These are sent by your web browser during GET or POST queries to fetch data that you see on your web page. We can use headers to “forge an identity” when connecting to the API (understanding pretending we’re something we’re not — like… a nice guy who’s got the right to get all of this data)
  • Payload — Especially useful in the case of POST requests (like in OAuth authentication mechanisms used by trading platforms — i.e. Coinbase, Alpaca Markets or else)
  • Response object — Useful to see which requesst holds the data we need.

Try it for yourself — click this link to one of Vanguard’s APIs which will take you to the actual JSON object that the browser will use to populate different element of the webpage.

The JSON object returned — Everything in the ETF screener is in there. Why do we care? Because Vanguard uses internal product codes (like most fund providers) to pin point some of their product pages to the right resources (like… holdings).

So… with a python dictionary we can actually use all that mumbo jumbo — I have set up a colabs notebook → set yourself up on colabs and try it for yourself, this is no obscure wizardry!

2. Fetching product id’s and gathering holdings.

Using the requests library we dump all of this remote Vanguard JSON in a Python data structure (dictionary of nested dictionaries).

  • When parsing the dictionary, the first key is called size and it returns the number of objects in the JSON response (i.e 380 ETFs in this case — since we haven’t filtered any of them initially)
  • the second key is called self — it has the actual API end point
  • the third key is called data and contains a list of dictionaries.
HAVE A LOOK AT THAT BEAUTY! Vanguard, top quality work guys, love it. Blackrock’s JSON is all about ALADDIN

With that data in hand we can:

  • Navigate to the data key of the JSON response.
  • Flatten all of it into a list of dictionaries.
  • Pass that list to a dataframe (pandas or polars) to make the data easy to filter and use.
  • Programmatically extract Vanguards product codes (I guarantee you will need these!).
  • Programmatically load Vanguards product pages.

Lets work through the data we received from Vanguard and isolate a few fields in the well structured and extensive response they’ve given us . I won’t show you their cool data structure here— just look on the colabs notebook to see how we can filter and isolate tickers and productID. Lets head to the product page which follows the logic below:

https://investor.vanguard.com/investment-products/etfs/profile/<ticker>

# Russell 3000 ticker for the Vanguard ETF is VTHR
# Don't get fooled by the institutional product which points to a fund.
# So based on the above logic we get:

https://investor.vanguard.com/investment-products/etfs/profile/VTHR

And after a little scrolling what do we see?

Vanguard likes to give you holdings, but they don’t really expect you to browse 3000 pages to check them out.

ARGG Man… How are we going to do this? We just rinse and repeat.

See!? I told you we needed to scrap the product code — It gets sent to the api which in exchange kindly returns all of the holdings in that 3000 pages table…
2982 sotcks in the response — I guess we’re missing 18 but at that stage I’m not gonna complain. Looks like they give it end of month — I’m not gonna fiddle with that okay?

So what does the data structure looks like? Well.. It is a list of dictionaries which follows the below schema:

{'type': 'portfolioHolding',
'asOfDate': '2023-06-30T00:00:00-04:00',
'longName': 'United Bankshares Inc./WV',
'shortName': 'UNITED BANKSHS',
'sharesHeld': '8303',
'marketValue': '246350',
'ticker': 'UBSI',
'isin': 'US9099071071',
'percentWeight': '0.01',
'notionalValue': '0',
'secMainType': '',
'secSubType': '',
'holdingType': '',
'cusip': '909907107',
'sedol': '2905794'
}

Pretty cool stuff — Neatly organised and ready to be pumped into your quant or qualitative pipelines kudos to the boyz and gurlz in Vanguards data teams for the great data structures.

3. Saying good morning to Morningstar — Case study of more complex fully customizable hidden API call.

And so… as I was looking at the Morningstar website (listening to funky stars — a 1996 cracktro) I told myself: Franklin there is lots of good stuff on there, I wish I could get some of that on my own server for huh… for fun. I got the scrapping kit out and applied the now tried and tested methodology of:

  • Landing on the ETF screener page
  • Scrapping data from the screener
  • Isolate product codes
  • Navigate to product pages
The Morningstar ETF Screener — it looks… austere. Don’t be fooled, that stuff is crazy good (behind the ugly GUI that is)

And so, as I was playing with the network section of the browser trying to isolate fetch requests. I ended up on this little bad boy.

Load the screener page -> Inspect -> Network tab -> reload the page -> select the Fetch request right there.

Something was DIFFERENT, something was better, something was AMAZING.

Security data points? This… can’t be real!

See those fields like SecId | Name separated by | ? These are ACTUAL data items INSIDE the Morningstar Direct database (that you can query via their paid portal or their Windows desktop application)— You know… the one you need to pay for? Which can only mean one thing:

  • The website is linked in someway to the actual database they usually charge you for (if you can believe that).
  • By crafting a query a certain way AND knowing the fields, we can in theory, get access to all data displayed on the website and use it in our own data pipelines AS WE SEE FIT. Understand: WE CAN SPECIFY WHAT WE WANT TO GET FROM THE MORNINGSTAR API.

Where do you get the fields? Well… In the developer documentation of course! And the Python package documentation… So HANDY.

4. Morningstar’s “Free bearer token”

At that stage I need to show you something else. How can we lift all that content from the network tab in Chrome ? You know the stuff with the headers, the payload etc… Well, we can make a CURL export. CURL is a utility used on the Linux Command line and that pretty much lets you do anything web related from the CLI. So we will:

  • Generate a CURL command in Chrome
  • Paste it to httpie
  • Format it to a Python requests snippet from httpie
  • Dump that into the Colabs notebook.
Navigate to the fetch request , right click on it and commit as cURL bash command. Then head to httpie
Once on httpie.io paste in the bar
httpie imports the Fetch request, formats it and gives you the option to export it as Python code which you can dump in your Python script.

Why is this important? BECAUSE YOU NEED SOMETHING SPECIFIC IN THE HEADERS IF YOU WANT TO GET THE DATA MORNINGSTAR DATA FOR FREE. If you try passing a query by just dumping the URL you get the below:

WHAT DO YOU MEAN UNAUTHORIZED?

Quite naturally most APIs expect some sort of authentication and in the case of Morningstar this is called “a bearer token” which is generated automatically and placed in the headers of your web browser to allow you to get “non-premium data” when you visit their website. This is what it looks like. What does a bearer token looks like? Well see below.

BEARER TOKEN — You’ll need that to get data programatically using their APIs.

Now… I know what you’re thinking: FRANKLIN ALL THAT BROWSER CLICKING IS WEARING ME DOWN. Okay, okay… just chill I have another way to do this with no clicking so we can just huh… AUTOMATE ALL OF IT.

5. Meet Playwright

What is Playwright? It is a library initially developed by Microsoft and later open sourced that allows you to ‘test web apps’ (understand a website and its associated functionalities). Alongside other tools like Selenium, Playwright is called a ‘web driver’ and it offers a way to command a web browser programmatically. Think of it as a robot sitting at your desk and clicking buttons on a webpage instead of you (while you work from home trying to dodge your computer’s log off timer). While you could talk to the robot and say: “Hey robot bro can you click those 3000 pages while I go off to the gym?” the best way would be to command it with code. This is what web drivers allow you to do: open a “headless browser” (understand there is no window to look at per say) that can programmatically interact with anything that sits on the web.

So… If we want to scrap the “non-premium” data from Morningstar we need to have a way to generate a token automatically. We can do this by instructing a headless browser to follow the steps we’ve outlined AND return to us the “Bearer token” programatically. This is when Playwright comes in: We can now create our own “Morgingstar Token generator”.

# Importing playwright (make sure to follow the installs in the Colabs book)
from playwright.sync_api import sync_playwright

# Function to isolate FETCH requests to the US API and dump them in a list of
# dictionaries
def intercept_requests(request, intercepted_requests):
if "www.us-api.morningstar.com" in request.url:
request_headers = request.headers
intercepted_requests.append({"url": request.url, "headers": request_headers})

# Connect playwright to the target
def get_matching_requests():
with sync_playwright() as p:
browser = p.chromium.launch()
page = browser.new_page()

# List to store the intercepted requests with their URLs and headers
intercepted_requests = []

# Intercept requests and store them in the list if they match the specified API
page.on("request", lambda request: intercept_requests(request, intercepted_requests))

# Navigate to the URL
page.goto("https://www.morningstar.co.uk/uk/funds/snapshot/snapshot.aspx?id=F0GBR04F90")

# Wait for the page to load (you can adjust the wait time as needed)
page.wait_for_load_state('load')

# Close the browser
browser.close()

return intercepted_requests

# Launch the function
matching_requests = get_matching_requests()

# We isolate the first instance in the list [0] in case we have more than one
# By default the value associated to the Authorization key is "Bearer <space> key"
# We need to remove "Bearer" and just grab the key.
print(matching_requests[0]['headers']['authorization'])

#Note to use with the python library you need to remove "Bearer" as such use
# the below line instead
# print(matching_requests[0]['headers']['authorization'][7:])
Oh YEAH!

5. Taking it further

As I was finishing dumping my first “Free Bearer Token” I thought… “WAIT! WOULD THIS WORK WITH MORNINGSTAR OWN PYTHON LIBRARY?” Morningstar wants you to get in touch with them to get a key if you want to use their python library, you know… they want to sell you a key. Well…

Okay this is what they want us to do…
Hmm… I am definitely getting something

I have only scratched the surface with both the API and Python library here. I think need to investigate further what I can get from all this!

So how about you have a go yourself? If you need inspiration I can pinpoint you to a great resource called VettaFi. These guys have done a tremendous job collating data on ETF issuers and ETF products and they have plethora of links that can bounce you off to great data sources, their website can serve as good “Headquarters” for your hum… “research needs”.

--

--