Mathematica and Jeff Bezos

Jeel Shah
4 min readSep 17, 2021

--

In which we learn how to use Mathematica’s web extraction capabilities to download all of Jeff Bezos’s Letters to Shareholders.

Listen to this article on Soundcloud

I was reading “Working Backwards” which often references a quote from a Jeff Bezos Shareholder Letter and I got to thinking … “why don’t I read all of his letters and see what I can learn?” So, like any good problem solver, I set out to find his Letters to Shareholders and lo’ and behold, I found them.

Then, I heard George say in my head “Care to make this interesting?” Why not? The night is young. So, I whipped open Mathematica, found the page where all of the Shareholder Letters existed and got cracking.

Things you will need

  1. Mathematica (> 12.0)

The problem to solve was: how can I download all of the shareholders letters to a folder on my computer from oldest to youngest?

I did some cursory exploration of the page and noticed that each of the links was an a tag with classes .module_link .module_link-shareholders . So naturally the next question would be “how might I extract these links and then programmatically download them?”

Enter Mathematica.

Mathematica allows you to create a web session where you can manipulate a browser programatically. This is especially useful when you are scraping webpages which generate content using Javascript etc. So, we want to:

  1. Start a web browser programmatically so we can manipulate it
  2. Extract a tags with classes .module_link .module_link-shareholders
  3. Extract the links from the hrefs from the a tags
  4. Download the letters to a folder

Spinning up a browser in Mathematica

url = "https://ir.aboutamazon.com/annual-reports-proxies-and-shareholder-letters/default.aspx" (-- (1) --)session = StartWebSession["Firefox"] (-- (2) --)WebExecute[session, {"OpenPage" -> url}] (-- (3) --)

If we read the code from bottom to top then we are asking Mathematica to start a web browser (3), using Firefox (2) and loading it with the url (1).

Extracting tags from a HTML page opened in WebSession

allAs = WebExecute[session, "LocateElements" -> "CSSSelector" -> ".module_link.module_link-shareholders"]

WebExecute lets you find elements on a page which fit a pattern. There are many different ways in which we can extract elements from a page, for instance using the HTML tag or using a XPath etc. What’s relevant for us is using CSSSelector . This allows us to find elements where the CSS tag is a specific class or id etc. In our case, we want to find all a tags which have classes .module_link and .module_link-shareholders . The above code will produce a list of WebElementObject , Mathematica’s way of representing some kind of HTML element.

Extracting attributes from a WebElementObject

Now, from these elements we’ll want to extract the links so we can download them. The Mathematica docs aren’t clear on how to do this. However, Murta (on Mathematica Stackexchange) has written a wonderful piece of code which helps us out.

getAttribute[element_WebElementObject, attribute_String] := 
getAttribute[$CurrentWebSession, element, attribute]
getAttribute[session_WebSessionObject, element_WebElementObject,
attribute_String] :=
With[{sessionInfo = session /@ {"SessionID", "Browser", "URL"}},
WebUnit`Private`attribute[sessionInfo, element["ElementId"],
attribute]]

Now, we can use getAttribute to get the href and we’re nearly there.

allLinks = Reverse[getAttribute[#, "href"] & /@ allAs];

We were reverse the list because, if you observe the links on the page then they are descending order and we want them in ascending order.

Downloading files to a folder

Before we jump into downloading the letters, we have to do a little bit of housekeeping. We want to download the letters so the name of each of the letter corresponds to the year in which it was written. In other words, we want something like “1997-Shareholder Letter” and so on until 2020.

namesAndLinks = 
Table[{StringJoin[ToString[n], "-Letter to Shareholders.pdf"],
allLinks[[n - 1996]]}, {n, 1997, 2020}]

We have the links on one hand and what we want are the corresponding names. We can loop over each of the links and associate them with a name e.g. “1998-Letter to Shareholders.pdf” and we’re all set.

We can use URLDownloadSubmit to save a file to our computer given a link. It has three parameters: link, file name, handler function (what do we want to have happen while the download is in progress or once it’s done etc.) You might be thinking: how can I change where the letters are downloaded to? Easy. We can use SetDirectory to set the download folder.

URLDownloadSubmit[#[[2]], #[[1]], 
HandlerFunctions -> <|"TaskFinished" -> Print|>] & /@ namesAndLinks

And with that little snippet, we will have downloaded all of the Bezos’ letters to shareholders, in a folder of our choosing, in ascending order. 🎉

--

--

Jeel Shah

Jeel is a product manager working on enterprise apps. Reach out via email: jeel.medium@fastmail.com