The precise ambiguity of talking about numbers

Žygimantas Medelis
Accelerated Text
Published in
6 min readJul 7, 2020

--

We talk numbers a lot. Numerical information is used to describe facts, identify opportunities, measure performance, and forecast the future. As with all the language things, we barely pay attention to the feats we are pulling off when doing them.

Photo by Markus Spiske on Unsplash

Often, when we communicate a numerical fact, we do not report a precise number. We say things like: “almost all” instead of 98.3%, “over a 100” instead of 103, “under an hour” instead of 56 minutes. What is remarkable is that we deploy this imprecision with remarkable accuracy.

My goal here is to unpack this unnoticed complexity in the way we communicate with numbers. Thus, showing what it takes for a natural language generation system — like the Accelerated Text open-source data to text platform we are building — to be able to express numeric information adequately.

What we do when we talk numbers is a generalization and pruning out the information which is not relevant. We make numbers seemingly ambiguous. By introducing this numeric ambiguity, we signal to our readers what is important. Take two examples:

  1. The PAP is a slick political machine, its share of the popular vote has never dipped below 60%. Last time around, in 2015, it won almost 70% of the popular vote.
  2. Florida reported 122,960 Covid-19 cases on Friday, up 7.8% from a day earlier, compared with an average increase of 4.1% in the previous seven days.

In the first example, the reporter is rightly signaling that it is irrelevant if PAP won 69.5% or 67.14% of the votes. In this context, it truly does not matter. Suppose the reporter had used the precise number. In that case, we’d think this is important and try to figure out why shifting our attention from the vital fact that PAP is a slick political machine consistently winning elections by a lot to some irrelevant factoid. The second example uses precise numbers because there every number counts. The pandemic is followed closely, and there is a huge difference between “almost 8%” and “7.8%”. However, news articles would rarely use higher precision numbers like 7.822%. Those belong in an article published in the scientific journal.

Thus we deploy this numeric vagueness with high precision and context sensitivity.

Let’s unpack what happens when we say “almost 70%”. While the actual number coming from the vote-counting is 67.9%. Successful communication has to inform, not mislead or confuse. Paul Grice formulated a series of maxims defining what it takes for our utterances to be meaningful. One of the maxims says that we should only provide as much information as needed. Any excess in the provided information has a possible negative impact on our message being understood as intended.

Therefore, when we say “almost 70%,” we are following Gricean maxim. The precise percentage of votes won is the information we do not need. 70% does not depart from the truth with “almost” added for good measure. It could even be argued that “almost” is not needed because any round percentages, be it 10% or 70%, are anyways perceived as approximated. In our particular example of landslide election results, we are allowed to apply even more coarse-grained approximations talking in thirds and halves, allowing me to truthfully state that over two-thirds of the electorate voted for PAP.

Generalization is another tool we employ to make ourselves as informative as possible. Generalization helps us to group the facts into representative clusters. We do not have to deal with individual data points but with fewer distinct groups of those data points through generalization. This way, someone at the receiving end of our message does not have to do the labor of analyzing and grouping the provided information. Thus when reporting on the weather outside, I could say “hot” — no matter the actual temperature as long as it is within fuzzy bounds of temperatures starting at 29C and ending somewhere at 35C. The same goes for the example with election results. It does not matter what is the precise number in the vicinity of 70%. What matters is that the result over two-thirds of the vote represents a landslide victory.

Even though we have the numbers, it might look like we’d be in a better position to provide exact values and allow the other party to use it as they see fit. It is not the right way we convey numerical information. That is if we do not seek to confuse and misinform the other party. The two elements discussed above: The Gricean “no less, no more” rule and the generalization principle, allow us to convey numerical information where we trade precision for increased clarity of our message. We also do not force extra labor on our audience by making them figure out what cluster of relevant values a given number belongs to.

When constructing such numeric messages, one does a couple of things. First, given the actual number — a data point — choose which value to provide in the message. A conversion from actual value to a given value has to be performed. The crux of the decision here is the granularity or the scale of generalization. Are we counting at increments of 10, thus ending up with a given value of 70%? Or generalization at the level of thirds is good enough, and we are good to go with two-thirds. Once this generalization step is done, one has to hedge that number. Hedges are words describing the relation between actual and a given value. These are our “under,” “over,” “around” words.

In addition to those two steps, we sometimes apply the third maneuver. Converting some numbers to their non-numeric expressions — favorite words: “third,” “half,” “all,” “dozen.” We do all the steps effortlessly. We want to embed it into our message with a high sensitivity to the context and the desired meaning. When this process of numeric approximation production needs to be automated in the natural language generation system, what comes effortlessly for us proves to be a task of high complexity for the machine.

Photo by REVOLT on Unsplash

A system automatically generating numerical approximations would need to work with the following concepts:

Actual Value — a number that needs to be approximated. Like the precise vote count result — 68.83%.

Given Value — a number that gets reported as 70%.

Generalization scale — to be used for approximation. This scale splits the range of the numbers used in a given domain into groups which are represented by a single digit. Those generalization scales will differ from domain to domain. For example:

  • When we deal with percentages, a generalization scale might be steps of 10% so that whatever is the actual percentage we report around 10, 20, …, 100 percent.
  • When dealing with money rounding to available denominations could be a way to go: $1, $2, $5, $10,…, $100.
  • In other cases, we might want to adopt the “one quarter” generalization step and snap our actual value to “a quarter”, “a half”, “two-thirds”, and “all”.

Hedge — a common use word describing the relation between actual and given values. Actual Value of 9.5 is below the Given Value of 10. Actual Value of 101 is over a Given Value of 100.

Favorite Number — to express some common language names for certain numbers. A 0.25 is a favorite number in that, that it has the name — a quarter.

The above concepts can be easily addressed with a bit of software engineering. The challenge, though, lies in choices. Which generalization scope to use? Am I good to report a 7.8% increase in disease cases, or is it 7.81%? Maybe “close to 8%” will do a better job of conveying the message. That hedge word — why “close”? “Around” might be just as good.

Those decisions are on an altogether different plane of complexity compared to constructing the necessary building blocks of the numerical approximations. We have built and open-sourced a library to help in constructing messages containing numerical information. The challenge of addressing the context-sensitive choice of generalization and hedge word selection remains to be addressed.

--

--

Žygimantas Medelis
Accelerated Text

CEO @ TokenMill a Natural Language Generation and Processing company