How Much is a Billion Dollars?

How to “put things in perspective” with NLP and probabilistic programming.

Jacopo Tagliabue

Published in

Analytics Vidhya

8 min readJul 24, 2019

Putting big numbers in perspective

“Billions and billions and billions.” — D. Trump

One standard day in Silicon Valley is filled with numbers dreams are made of: acquisitions, tech capacity, earnings — things seem to happen on a scale that is just too hard for us humans to understand.

This is not just a tech quirk, as all interventions with global ambition have an impact (in time/money/natural resources, etc.) that can be expressed only in gigantic units of measure: just to take a (in)famous example, the border wall between U.S. and Mexico costs a whopping 15 billion dollars.

Undoubtably, some among us are better suited to reason about big numbers:

[ the Italian reader hopefully didn’t forget the memorable Ing. Cane: “Mille!” ]

For those who aren’t so acquainted with billions, it is increasingly hard to put things in perspective: as a consequence, our ability to make meaningful comparisons and judgments is severely limited. In a “data-driven, online world”, it should be clear that “understanding numerical measurements” is not a purely academic problem. In the eloquent words of Riederer et al.:

Inadequate reasoning about magnitudes can have negative impacts on individuals’ reasoning about finances, medical care, sustainability, and the ability to differentiate between honest reporting and “fake news”.

For instance, Greece as a country has repaid 41 billion euros of its debt in four years (2015–2019), which seems a big pile of cash until you compare it to Softbank’s latest 100B fund [yes, it looks like a VC fund has enough money to be potentially a factor in the economy of a European country — the reader is invited to draw her own conclusions from this fact].

A smart, funny and delightful paper by Chaganty and Liang proposes to exploit language compositionality to automatically “put things in perspective”.

Their NLP system works by composing paraphrases: when you say that “Cristiano Ronaldo was acquired for $131 million”, what you’re really saying is that buying Ronaldo is the equivalent of employing everyone in Texas over a lunch period; when you say that “water is flowing into lake XYZ at a rate of 150 cubic meters per second”, what you mean is that XYZ every second gets the same amount of water that would flow from a tap left on for a week.

In this brief post, we are going to tackle the challenge by leveraging powerful ideas on generative models and probabilistic programs; in particular, we will write code to:

build a small but effective base of “atomic” facts leveraging DBPedia;
build a “perspective generator”, that will translate an expression from humongous (e.g. “billion of dollars”) scale to everyday life (e.g. “X times the median income of U.S.”) scale.

We will go together from zero to a working app in less than 10 minutes (or, as we may say, the time it takes to cook Barilla Spaghetti al dente):

The code for this tutorial is freely available on GitHub, with some interactive code available here.

Getting our facts straight

“…and a fact is the most stubborn thing in the world” — M. Bulgakov

While Chaganty and Liang manually (and carefully) constructed their knowledge base from the United States Bureau of Statistics, we are going to take the lazy path and automate as much as we can to bootstrap our project: the bulk of our knowledge will come from the awesome free project known as DBPedia, which is basically a structured, machine-friendly version of Wikipedia.

In a few lines of Python we can parse DBPedia for numerical properties and use the triplestore to get an idea of the most frequent ones. As an illustration of the kind of facts we collected, some properties (with units) and examples are listed below:

runtime (seconds) -> Jurassic Park = 7620.0populationDensity (person/km^2) -> Durham (England) = 257tuition ($) -> University of Toronto School = 20875.0 careerPrizeMoney ($) -> Justine Henin = 2.0863335E7areaTotal (m^2) -> Normandin (Quebec) = 2.162E8

[ Our property selection is obviously just a stub: the non-lazy reader is encouraged to modify the pre-processing script to get even more interesting facts. ]

Once the knowledge base is completed and stored in a friendly format, it is time for our probabilistic magic: how much is indeed one billion?

Building the A.I. language generator

“Language is the source of (mis)understandings” — A. de Saint-Exupéry

Recall that the final goal of our model is to explain a target numerical expression using “more basic” facts to help us “put things in perspective”.

For example, it is interesting to know that a world-class athlete like Justin Henin would need almost 50 careers to reach 1 billion dollars (billionaires are indeed not like the rest of us!):

1 billion dollars ~= 47 careers of Justin Henin

or that:

Harvard Endowment ~= 40 times Vanuatu's GDP

How can we leverage our DBPedia knowledge base to automatically generate these “translations”?

We will solve this problem exploiting the compositionality of language, i.e. the fact that we can build complex expressions out of simpler ones to express the same quantity in different ways (the faithful reader may recall our musings about language and grammars from previous posts). Our probabilistic model needs therefore two components:

a probabilistic grammar, which generates candidate expressions;
a goodness-of-fit measure, to rank expressions based on how good they are in approximating the target value.

With the help of our beloved WebPPL, a sample script implementing the model may look like the following (please see the full repo for more details and a bigger knowledge base):

Some nerdy comments for the non-lazy readers are in order (you can safely skip to the results below!):

the language model is inherently biased towards simplicity, so that simpler expressions are, a priori, more likely to be generated; for priors, we use PageRank over entities in DBPedia as our guide and prefer smaller multipliers to bigger ones;
inference is performed by enumeration: to simplify algebraic manipulations, units of measure are not entirely arbitrary (e.g. you can’t get seconds X seconds);
the distance between candidate and target values is modeled as a normal distribution (exercise for the non-lazy reader: parametrize sigmadepending on target);
we report the top K candidates to get a more diverse sample of expressions.

So, what happens when you run the model? How much is indeed one billion? Some sample results are rephrased and presented below; changing the target changes the candidates, obviously, and highlights the model capabilities of combining units properly:

1 billion dollars = 50 times what Derek Money would make working for the entire history of Norway.1000000 dollars = paying the tuition for Christian Brothers College High School for the entire history of Sri Lanka.10 dollars = what Steve Penny would make working for half an episode of "How I Met Your Mother".10M people = total people in Italy if Italy had the same population density (inhabitants/km2) of United States.

Finally, since this model is too awesome to be kept just inside GitHub, we used our lambda-based architecture for WebPPL models in the cloud to share and visualize our efforts: in minutes, we have a fully functional web app to interact with our knowledge base through our probabilistic model:

WebPPL model in a client-server settings (video cut for display purposes, original here).

What’s next?

“If you don’t know where you are going, you’ll end up someplace else” — Y. Berra

We promised to get you from-zero-to-web-app in the time it takes to prepare spaghetti al dente (a.k.a. ~10 minutes) — we did it, but this also means we just scratched the surface of “perspective generation”. Some pretty obvious improvements come to mind immediately, such as:

leverage DBPedia better: on the one hand, our retrieval of properties and facts is just a simple stub — a lot more can be done to design a better experience for the user; on the other, if you feel adventurous, there is a lot of room to get more from DBPedia than what is explicitly encoded: properties such as gdp may be present but incomplete in crucial respects: a full NLP/ontology pipeline can be built to drastically improve the pool of available facts;
improve the Bayesian machinery: using PageRank was a convenient and not-entirely-arbitrary choice to perform cut-off and establish some priors; partially because of our laziness, results are far from optimal though; on a different note, we could take the hint from Chaganty and Liang and introduce some “pragmatic principles” when it comes to rank candidate expressions: cost of an employee × the population of Texas × the time taken for lunch is more informative then cost of property in the Bay Area × area of a city block. It would be lovely if the system could handle these nuances as well! Moreover, there are other interesting studies from the psychological literature that could be examined for more inspiring design principles, such as this and this;
make it more entertaining: there is a lot of cool visualizations that may effectively represent the relative scale of two numbers; to build a funnier experience for the user, an easy extension would be to sample a fact from the knowledge base instead of just specifying a target number: how much is Tyra Banks worth? Finally, deduplication can (and should) be performed: since Italy population X US density is the same as US density X Italy population, it would make sense to collapse the two strings into one candidate expression.

Last but not least, this playful post is indeed the tip of a giant iceberg: the way in which humans process numerical information is key to their decision processes.

While improving numerical literacy has been a long-standing challenge, relatively little has been achieved. As noted by Barrio et al., the problem is so pervasive that the New York Times editor issued a statement calling for writers to “put large numbers in context”. Since it is well established that perspectives greatly improve people inferential abilities and overcome their biases, A.I.-based tools are the only hope we have to keep up with the amount of information produced today. Moreover, in the age of Big Data and automated insights (e.g. Google AutoML and the like), it is of crucial importance that the results of A.I. data analysis can be understood by humans in human-friendly terms.

If the “UX” of data consumption is as important as the accuracy of the prediction itself, being able to “put things in perspective” — as we did here — adds tremendous value to any decision chain involving numerical information and humans.

See you, space cowboys

If you like our approach, please share your feedback/question/A.I. story with jacopo.tagliabue@tooso.ai.

Don’t forget to get the latest from Tooso on Medium, Linkedin, Twitter and Instagram.

Acknowledgments

While very different for implementation details and ambition, this exploratory work was heavily inspired by How Much is 131 Million Dollars?. Thanks to Andrea Polonioli for very helpful comments on a previous draft of this work.