Base 62 text encoding/decoding

Abhishek Jha
6 min readJun 13, 2020

--

An idea to encode text to any base

What does it even mean?

Given a string containing utf-8 characters, we want to convert it to a string having characters only from a set of characters of size 62.

What is this set of characters? And why the size 62?

Usually the set contains all small case letters (a..z), upper case letters (A..Z) and digits (0..9). Hence the size of set is 26+26+10= 62. Also known as alphanumeric text.

Ok so we start with a string and end up with another string, why would anyone do that?

  1. You have a string with special characters and want to convert it to alphanumeric string for safety reasons.
  2. Just for the fun of it.

Honestly it does not have much practical use. But the idea discussed here can be used to convert it to any base. I hope thats good enough for motivation.

So where do we start?

The concept of base does not have much meaning for Text. Base is mostly used for numbers. For eg,
If we represent numbers only using 0..9, it will be in base 10. Most of the numbers we encounter in day to day life are represented in base 10.(Why?)
If we use only 0 and 1, it will be in base 2. Also known as binary.
Using 0..9 and a..f , it will be in base 16, Also known as Hexadecimal.

So these are different numeral systems. Representation of same value might change based on which system you choose.
For example

1100 in binary = 12 in decimal = c in hex

So a number can be represented in any base you like including base 62. And the other interesting fact is that the process of converting a number in decimal to any other base is same.

Here is a rough idea:

  1. Keep dividing the decimal number by base till it is greater than 0.
  2. All the remainders put together in their respective order of occurrence is the representation in that base.

Here is a code in Golang to explain the concept mentioned above.

const (
base uint64 = 62
characterSet = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz"
)

func toBase62(num uint64) string {
encoded := ""
for
num > 0 {
r := num % base
num /= base
encoded = string(characterSet[r]) + encoded

}
return encoded
}

So now we know how to convert any decimal number in base 62. What next?

What if we figure out how to convert any string to a decimal number, and then we can use the above idea to convert the decimal number to base 62.

It might sound weird converting a string to a decimal number but actually it is very straightforward because of the way strings are handled in computer.

Every string can be represented as a sequence of bytes. Every byte is a number between 0 and 255.

So our algorithm would be

  1. First represent the string in bytes.
  2. Encode the bytes from above in hex.
  3. Convert the hex encoded value into decimal.
  4. Convert the decimal number into base 62

For eg: Lets try to encode the string “SIMPLE

.--------.----.----.----.----.----.----.
| string | S | I | M | P | L | E |
:--------+----+----+----+----+----+----:
| bytes | 83 | 73 | 77 | 80 | 76 | 69 |
:--------+----+----+----+----+----+----:
| hex | 53 | 49 | 4d | 50 | 4c | 45 |
'--------'----'----'----'----'----'----'
53494d504c45 in hex = 91574294826053 in decimal91574294826053 in decimal = Q0DRQksv in base 62

In bytes representation each letter is represented in its equivalent ASCII value.
Each byte is then converted to its corresponding hex value. Each hex value is then concatenated to form a single hex string. The decimal equivalent of this hex string is calculated. And in the end this decimal value is converted to base 62 using the idea discussed earlier.

So base 62 encoding of the word “SIMPLE” is “Q0DRQksv

Why hex encoding?

Theoretically it could be any encoding. I chose hex as libs for hex encoding and decoding is easily available in most of the languages. And its very easy to convert from hex to decimal.

What about the size of decimal value?

The decimal number generated in step 3 is very large. For longer strings it might not fit into a 64 bit int. To take care of this you can encode N characters at a time. And concatenate all encodings to form the final result.

Awesome, what about decoding the encoded string?

Simple. Just execute the steps used while encoding in reverse order.

  1. Convert the base 62 number to decimal.
  2. Calculate the hex equivalent of the decimal number.
  3. Decode the hex in bytes.
  4. Get the original string from the bytes.

To get the decimal number from base 62 string, for each character raise the base to the power index and multiply the result by the decimal equivalent of character.

Here is a code in Golang to convert base 62 encoded string to number

const (
base uint64 = 62
characterSet = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz"
)
func fromBase62(encoded string) (uint64, error) {
var val uint64
for index, char := range encoded {
pow := len(encoded) - (index + 1)
pos := strings.IndexRune(characterSet, char)
if pos == -1 {
return 0, errors.New("invalid character: " + string(char))
}

val += uint64(pos) * uint64(math.Pow(float64(base), float64(pow)))
}

return val, nil
}

While encoding we encoded N characters at a time to avoid overflow, what happens while decoding?

How many characters to decode(say D) at a time is dependent on the value of N.
Let’s say while encoding you selected N = 2, that means you will encode 2 bytes in each iteration. As per our algorithm 2 bytes can have max value

ffff in hex = 65535 in decimal = H31 in base 62

So to represent max value of 2 bytes we need 3 chars in base 62. Hence while decoding we should select 3 characters at a time. i.e.
D = 3 for N = 2

Also while encoding we might end up with a number which can be represented by just using 1 or 2 characters in base 62. This might cause problem while decoding.
To solve it we can pad left these numbers with 0(zero) so that on every iteration encoded base 62 string has length 3.

To summarise D is directly dependent on N. It can be calculated by this expression

Number of characters to decode

In Golang code

var D = int(math.Ceil(math.Log(math.Pow(16, 2*N)-1) / math.Log(62)))/*
Some sample values
for N = 2, D = 3
for N = 5, D = 7
for N = 8, D = 11
*/

Awesome, will this idea work only for strings with ASCII characters?

Since we are converting the string to bytes , it should work for all encodings. I have tested it for nihoso “日本語” and it worked fine.

The code for encode and decode function in Golang

const encodingChunkSize = 2

// no of bytes required in base62 to represent hex encoded string value of length encodingChunkSize
// given by formula :: int(math.Ceil(math.Log(math.Pow(16, 2*encodingChunkSize)-1) / math.Log(62)))
const decodingChunkSize = 3

func Encode(str string) string {
var encoded strings.Builder

inBytes := []byte(str)
byteLength := len(inBytes)

for i := 0; i < byteLength; i += encodingChunkSize {
chunk := inBytes[i:minOf(i+encodingChunkSize, byteLength)]
s := hex.EncodeToString(chunk)
val, _ := strconv.ParseUint(s, 16, 64)
w := padLeft(toBase62(val), "0", decodingChunkSize)
encoded.WriteString(w)
}
return encoded.String()
}

func Decode(encoded string) (string, error) {
decodedBytes := []byte{}
for i := 0; i < len(encoded); i += decodingChunkSize {
chunk := encoded[i:minOf(i+decodingChunkSize, len(encoded))]
val, err := fromBase62(chunk)
if err != nil {
return "", err
}
chunkHex := strconv.FormatUint(val, 16)
dst := make([]byte, hex.DecodedLen(len([]byte(chunkHex))))
_, err = hex.Decode(dst, []byte(chunkHex))
if err != nil {
return "", errors.Wrap(err, "malformed input")
}
decodedBytes = append(decodedBytes, dst...)
}
s := string(decodedBytes)
return s, nil
}

func minOf(a int, b int) int {
if a < b {
return a
}
return b
}

func padLeft(str, pad string, length int) string {
for len(str) < length {
str = pad + str
}
return str
}

Here is the github link for the complete code. Enjoy

--

--