Formatting numbers for machines and mortals

I was deeply proud the other day when my five year old son encountered this screen in one of his favorite games, and exclaimed: “Why didn’t they put a comma so that I see it’s twenty thousand and not two thousand?!”

A genuine data nerd in the making :)

His comment nicely illustrates that number formatting is an important usability issue. But not only is it important for humans to more clearly see the numbers that are being presented to them, but also for computers to make sense of the data files they are asked to interpret.

Decimal marks and thousands separators

In most cases the two formatting symbols we care the most about are the decimal mark and the thousands separator. The decimal mark separates the integer part from the fractional part of numbers written in decimal form, and the thousands separator groups the digits of the integer part of large numbers into groups of three so that it is easy see at a glance the order of magnitude of the number in thousands, millions, billions, etc.

The biggest trap you will be caught in here is that while most of the English-speaking world uses a period (.) as a decimal mark and a comma (,) as a thousands separator, many other countries (including most of Europe and South America and large parts of Africa - full list here) do the exact opposite: Their decimal mark is a comma and their thousands separator is a period.

Throughout history, people have experimented with various other symbols as both decimal marks ( ¯ , superscript, | , · ) and thousands separators ( ‘ , _ ). In Asia some locales sometimes even use different digit groupings than the thousands grouping of three digits that we are accustomed to (one million might be written as 10’00'000), but we won’t dive into that here.

The period and the comma are confusing enough. When you come across a numbers like:

20,350
20.350

…there is no way to tell which is “twenty thousand three hundred and fifty” and which is “twenty point three five zero” without further context. In fact the line with the comma might just as well be a comma separated list of the integer numbers “twenty” and “three hundred and fifty”.

Be kind — format your numbers properly

Before we dive into methods you might use to resolve such ambiguities, let’s talk about what you can do to format your numbers for the least amount of ambiguity.

The ground rules for this are somewhat different based on whether your intended “audience” is a human or a computer.

If you are displaying numbers to a human, you will want to:

  • Use the appropriate decimal mark for your user’s locale.
  • Use a non-breaking space as a thousands separator. It’s non-ambiguous. Some fonts even offer an especially thin non-breaking space which is ideal for this purpose.
  • An exception to the thousands separator guideline is that for numbers from 1000 to 9999 you may skip the thousands separator to avoid breaking a single digit off. 4500 is more legible than 4 500. However, in a table column where some of the numbers are within this range and others are larger, you will want to stick with the thousands separator, for consistency.
  • Avoid three significant digits after a decimal mark when the use-case allows. Standardizing on that exact number of significant digits rarely make sense and is a source of confusion.
  • For numbers between 0 and 1, a zero should precede the decimal mark. In other words, write 0.354, rather than just .354
  • Use a monospace font (a font where each digit takes up the same amount of horizontal space) if possible, and make sure that numbers in a column are aligned on the decimal point or right-aligned in the case of integers.
  • When dealing with very large numbers you may want to use a unit multiplier such as “(millions)” as a label on a chart axis or in a column header rather than writing the numbers out in full. In some cases using metric prefixes such as (k = kilo, M = mega, G = giga, …) may be helpful. But be careful. While scientifically correct, nobody* talks about “giga-dollars”! In fact they will expect the postfix B = billions in that case. But not even the meaning of “billion” is non-ambiguous (see Long and short numeric scales)! Ergo: Use of such post- and prefixes depends on the domain, locale and unit of the data.
  • Avoid commas, spaces or periods to separate values in a list. Table columns or tab characters are a good choice (remember the alignment rules above). If you must write numbers in a dense list, use semicolons, vertical bars or tab characters.

Here is a table of well formatted numbers for human consumption (in an English-speaking locale):

36.42 | 33.64 | 7006 | 0.33 | 10 938 272 | 24.64
34.58 | 30.20 | 878 | 0.45 | 6 093 679 | 27.20
29.57 | 37.62 | 3719 | 0.94 | 16 594 896 | 28.1
27.12 | 37.66 | 1478 | 0.93 | 12 145 826 | 27.66
27.42 | 44.42 | 2185 | 0.38 | 14 433 001 | 29

On the other hand, if you are storing numbers in a file for computers to read the rules are slightly different:

  • Include the locale you use for your file format as meta-data in the file if possible.
  • Use the appropriate decimal mark for that locale.
  • Don’t use thousands separators at all.
  • Avoid three significant digits after a decimal mark when the use-case allows.
  • For numbers between 0 and 1, a zero should precede the decimal mark.
  • Don’t use unit multipliers.
  • Use a semicolon, tab or vertical bar to separate values in a list.

The values from the human readable list above could then look something like this in computer readable format (in a German-Austrian locale):

Locale: de-at
36,42;33,64;7006;0,33;10938272;24,51
34,58;30,20;878;0,45;6093679;27,51
29,57;37,62;3719;0,94;16594896;28,1
27,12;37,66;11478;0,93;12145826;27,51
27,42;44,42;2185;0,38;14433001;29

But, I’m a computer!

Now that we’ve made sure you minimize the ambiguity of your data for humans and computers alike, let’s get back to the issue of determining how to read numbers we encounter when we don’t know the locale for sure.

In short, the only way to do so is from their context, such as:

  • Language of the text surrounding the number. Use the country list as guidance.
  • Column headers that have hints or tell-tale signs of which it is: “Price” is unlikely to be stored with three significant digits after the decimal point; “Count” is probably an integer number; “Centimeters” is unlikely to be in the tens or hundreds of thousands.
  • Other separators in the same data file: In the line below, the comma after the apparent date gives away that the comma is a value separator here and by extension the period is a decimal mark.
2009–03–15,27.120,37.664,64.782,0.000,37.073,10.442,27.516
  • Other numbers in the same data file: As only numbers between 1,000 and 99,999 have this ambiguity, other numbers outside this range in the same context, that can help determine. Here “Limit” gives away that the period is being used as a thousands separator:
Id    Value    Limit
345 20.350 1.000.000
  • …also, if there is a decimal number in the surrounding context you may be able to infer that the comma is a thousands separator if the period is obviously a decimal mark (or vice versa):
Id    Price    Thrust
246 20.35 23,400

The problem with most of these contextual rules is that while they may be relatively easy for humans to utilize on a case-by-case basis, the underlying heuristics can be pretty hard to code. You don’t want to include hundreds of language lexicons in your software just to determine which number formatting is being used, and rules based on column headers are both fragile and language specific. The arsenal you have to automatically detect the format may therefore be smaller.

And when using other people’s software, be careful. Many software solutions — even professional data software — do a terrible job at this an can screw up data imports in ways you wouldn’t believe.

I have personally filed bug reports with more than one international statistics agency where the values in their data systems are obviously off by a factor of a thousand, or a decimal point has been accidentally dropped in a value, or entire column.

Key take-away: Always sanity check imported values before you start working with them.

Why didn’t they include…

Proud as I am, when I’ve properly educated my five year old, he should say: “Why didn’t they put a non-breaking space so that I see it’s twenty thousand and not two thousand, as recommended by the 22nd General Conference on Weight and Measures in 2002 and adopted in ISO 31–0?!”. Or not…

In any case: There you go Subway Surfers, I fixed it for you:


* Christos Karras (@ckarras) pointed out on Twitter that in Canada the use of “gigadollars” is quite common, it’s even defined in a terminology and linguistic database provided by the Canadian Government. I wonder if this use of the term has to do with Canada’s official bilingualism?