Sven VC
Concerning Pharo
Published in
7 min readDec 15, 2020

--

Understanding Base58 Encoding

It is all about integers

Photo by Dan-Cristian Pădureț on Unsplash

Some time ago something popped up in my inbox that piqued my interest: a draft RFC about Base58 Encoding, The Base58 Encoding Scheme.

I already knew about Base58 encoding: it is technique to encode arbitrary binary data into a textual representation that is easier to handle, much like Base64.

Base58 encoding was designed to encode Bitcoin addresses. It has the following characteristics:

  • its alphabet avoids similar looking letters
  • it does not use non-alphanumeric characters

These features help humans to make less errors and allows direct inclusion in filenames and URLs.

What did however confuse me a lot were the encoding and decoding algorithms in the RFC. They were describing a particular implementation instead of the mathematical principles.

So I started exploring to try understand what was going. Here is what I learnt. All code was written in Pharo, a pure object-oriented programming language and powerful environment, focused on simplicity and immediate feedback.

Decimal numbers

The most common way to notate integers is using decimal notation, or more precisely using the base-ten positional numeral system.

For a ten-based system, we need 10 digits, 0, 1, 2, 3, 4, 5, 6, 7, 8 and 9. Each of these has a numerical value associated to them, again 0 to 9. So the base is 10, the alphabet is ‘0123456789’ and each digit’s value is its zero-based index in the alphabet.

The position in the digit string determines its meaning. From right to left, each increasing position corresponds to another, higher power of the base.

For example, 123 is seen as (1 * 100) + (2 * 10) + 3. Using powers of the base this becomes (1 * 10²) + (2 * 10¹) + (3 * 10⁰). We can rewrite this to avoid the power operation (((1 * 10) + 2) * 10) + 3.

Let’s try this in code:

input := '1234'.
digits := '0123456789'.
n := 0.
input do: [ :each |
n := n * 10 + (digits indexOf: each) — 1 ].
n.

What we did above is parse a string, inputcontaining a decimal number into an internal machine number, n. Note that Pharo uses one-based indexes.

What about the other direction ? How do we produce the decimal number string given an internal machine integer ?

We can take the integer apart using the quotient (integer division) and modulo (remainder after integer division) with the base.

In Pharo, the operators are // for quotient and \\ for remainder. For example,

10 // 3 = 3.
10 \\ 3 = 1.
(3 * 3) + 1 = 10.
n = (b * (n // b)) + (n \\ b).123 \\ 10 = 3.
123 // 10 = 12.

The fourth line above is the invariant that holds for quotient and remainder. The last two lines above show how we take the lowest (rightmost, least significant) digit out of an integer and shift the integer to the right.

Now, let’s try putting all of this together:

n := 1234.
digits := '0123456789'.
(String streamContents: [ :out |
[ n > 0 ] whileTrue: [
out nextPut: (digits at: (n \\ 10) + 1).
n := n // 10 ] ]) reverse.

What we did was implement a printer for positive integers that produces a decimal number string given an internal machine integer.

Since we’re working from the right (least significant) to left (most significant) digit, we have to reverse the result.

Any base, any alphabet

The code that we wrote in the previous section seems specific to decimal numbers, but it is actually quite general.

By using the digits ‘0123456789ABCDEF’ and 16 instead of 10 as base, we get a parser and printer for hexadecimal numbers.

Similarly, ‘01234567’ and 8 for octal numbers and even ‘01’ and 2 for binary numbers.

We can construct strange new variants by using odd bases and odd alphabets. For example, using ‘ABCDEFGHIJKL’ with base 12 would work just fine. In this particular system, 1234 would be written as ‘IGK’ — a number without actual digits.

Bytes are integers too

A sequence of 8-bit bytes, a ByteArray, can also be seen as one large integer, by considering each byte as a digit that is its own value, using base 256.

Here is code for the ByteArray to Integer conversion operation:

bytes := #[ 10 20 30 40 50 60 ].n := 0.
bytes do: [ :each |
n := n * 256 + each ].
n.

This is functionally equivalent to what the builtin Pharo method ByteArray>>#asIntegerdoes.

This bytes to integer conversion is also called the network byte order, big endian, most to least significant digit form. It is how 16-bit, 32-bit, 64-bit integers are stored in memory in such systems. This is in contrast to little endian systems where the order is reversed.

The reverse operation converts an Integer into a ByteArray, just like the builtin Integer>>#asByteArraymethod.

n := 11081521574460.(ByteArray streamContents: [ :out |
[ n > 0 ] whileTrue: [
out nextPut: n \\ 256.
n := n // 256 ] ]) reverse.

Note how similar the code in this section is compared to the code in the first section. The only difference is that we do not need to convert a character from the alphabet into a numerical value or vice versa, since each element is its own value.

BTW, Integers in Pharo can be of arbitrary size, which is necessary to make this possible.

The encoding principle

Using the techniques described above, we can now describe Base58 encoding as follows:

  • convert a ByteArray to an Integer
  • print that Integer using base 58 and a special alphabet

Decoding then becomes:

  • parse the representation using base 58 and a special alphabet
  • convert that Integer to a ByteArray

The 58 digit long alphabet to use is equal to

'123456789ABCDEFGHJKLMNPQRSTUVWXYZabcdefghijkmnopqrstuvwxyz'

Note that this is a strange alphabet, since the decimal digits do not stand for themselves (1 is zero for example) and some letters are not used.

Here is a test vector from the RFC:

'Hello World!' asByteArray.

would be Base58 encoded to

'2NEpo7TZRRrLZSi2U'.

Leading zeros

We skipped over one significant detail though. Just like 1234, 01234 or 001234 are the same (i.e. the leading zero is insignificant), ByteArrays with leading zero bytes are encoded to the same integer.

For example,

#[ 1 2 3 ] asInteger.
#[ 0 1 2 3 ] asInteger.
#[ 0 0 1 2 3 ] asInteger.

are all equal to 66051. That is not what we want. In this case we want to preserve those leading zero bytes.

Luckily there is an elegant solution to handle this situation: add as many leading zeroes (or at least the digit representing zero) to the final encoding as there are leading zero bytes.

For the 3 examples above, that would result in ‘66051’, ‘066051’ and ‘0066051’ respectively. Of course, we need no additional zero after step one, but after step two.

In Base58 encoding, leading zeros are actual 1s. So the number of leading 1s is equal to the number of leading zero bytes, to be added in front of the ByteArray result after step two. Other than this, the leading 1s can be ignored, just like any leading zero.

For example, the RFC states that the test vector

ByteArray readHexFrom: '000000287fb4cd'.#[0 0 0 40 127 180 205]

should be Base58 encoded to

'111233QC4'

Putting it all together

We are now ready to show the complete solution. First the decoding:

alphabet := '123456789ABCDEFGHJKLMNPQRSTUVWXYZabcdefghijkmnopqrstuvwxyz'.
input := '2NEpo7TZRRrLZSi2U'.
ByteArray streamContents: [ :out | | n |
n := 0.
input do: [ :each | | b |
b := (alphabet
indexOf: each
ifAbsent: [ self error: 'Base58 digit expected' ]) — 1.
(n = 0 and: [ b = 0 ])
ifTrue: [ out nextPut: 0 ].
n := n * 58 + b ].
out nextPutAll: n asByteArray ].

Using the indexOf:ifAbsent:message instead of indexOf:takes care of an error case. Zero digits are output as zero bytes while the number itself is still zero (i.e. only for leading zeros). Next we decode using base 58 and the alphabet. Conversion to a ByteArray happens in the last line.

Secondly we implement the encoding:

alphabet := '123456789ABCDEFGHJKLMNPQRSTUVWXYZabcdefghijkmnopqrstuvwxyz'.
input := 'Hello World!' asByteArray.
(String streamContents: [ :out | | r n |
n := input asInteger.
[ n > 0 ] whileTrue: [
out nextPut: (alphabet at: (n \\ 58) + 1).
n := n // 58 ].
r := input readStream.
[ r peekFor: 0 ] whileTrue: [
out nextPut: alphabet first ] ]) reverse.

We start by converting the ByteArray to an Integer. Next we encode it using base 58 and the alphabet. Then we scan over the input bytes and output 1s as long as we see zero bytes. Finally we reverse the result.

All in all, I find Base58 encoding quite elegant. I hope that my explanation is easier to understand by being closer to the actual mathematical principles used.

PS: Esteban A. Maringolo has a pharo-base58 project which contains additional features, including the ‘check’ algorithm. If you are looking for more, it is worth checking out.

--

--

Sven VC
Concerning Pharo

loves high-level, dynamic and interactive object-oriented systems