How to build a custom url shortener

harpermaddox
4 min readFeb 1, 2016

--

When we were getting our social media tools off the ground at EdgeTheory, we needed to track the performance of our messages at an atomic level. Rather than paying bit.ly to produce millions of links, we decided to build our own.

When building a shortener, you’ve got two dimensions to optimize on: 1) domain and 2) shortened hash. First, choose a short domain. I suggest using http://domai.nr to check for valid domains.

Best practices for domains:

  • Avoid using a .com; Use a two letter top level domain (TLD) like .io or .ly
  • Take your company or product name and remove the vowels
  • Every character matters! Focus on a short domain (8 or less with dot and TLD)

The shortened hash is what comes after the domain. E.g., “http://goo.gl/abcdef”. The most popular shortened urls use a series of letters and numbers, rather than just numbers. Why? Using a number is really inefficient, since encoding has exponential growth:

  • 4 digits with numbers = 1⁰⁴ — 1 = 9,999 entries
  • 4 digits with letters = 2⁶⁴ — 1 = 456,975 entries
  • 4 digits with uppercase and lowercase letters = 5²⁴ — 1 = 7,311,615 entries
  • 4 digits with numbers, uppercase, and lowercase letters = 6²⁴ — 1 = 14,776,336 entries

Base62 uses almost half the space of using numbers, so it is incredibly valuable. Some shorteners use the Base64 standard which usually add “+” and “/” or “-” and “_” to the character set. You could conceivably use the whole 255 entry ASCII charset, but when going past 62 characters you can run into issues with what characters can exist in a url. You’re immediately prohibited from using control characters like ?, #, %, :, /, .

In our example, we choose Base62. As a reminder, when creating your database table, the encoding must be case insensitive. Otherwise, you won’t be able to differentiate “ABC” from “ABc”.

Generating the shortened hash using Base 62

To generate a shortened hash, we need to be able to convert back and forth from Base 10 (decimal) to Base62. Its a standard base conversion, just like converting decimal to hexadecimal. If you want to get fancy, call it a Bijective function, since there is a one to one mapping between base 10 and base 62.

Base 62 encoding (in Ruby)

BASE_62_VALUES = "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ"def self.to_base(num, base = 62)
return BASE_62_VALUES[0] if num == 0
result = ""
while num > 0
r = num % base
result.prepend(BASE_62_VALUES[r])
num = (num / base).floor
end
result
end

Let’s walk through how this works. First we enumerate our entire character set in a constant called BASE_62_VALUES. Once we do this, we can do a quick array lookup (0 to 61) to find our encoded character. If you want to have a longer set than base 62, just add more characters. Also, the set begins with numbers, so the first 10 digits matches the decimal system. It’s also how hexidecimal (“0123456789ABCDEF”) works.

If the decimal number is 0, then we immediately break and return the first character in our charset. If we didn’t do this, we’d return an empty string. Since we are working off a different numeric system than base 10, we have to use strings. We begin with an empty string then work backwards on the base 10 number, hacking off one digit at a time. We use the modulus function to get the remainder when dividing the number by 62 (our base). Then we push the character value at that index onto the string. We then remove one base 62 digit from the base 10 number, by dividing and flooring it. Then we loop until we reduce the base 10 number to zero.

Base 62 decoding

def self.from_base(value, base = 62)
base_10_num = 0
digit = 0
value.reverse.each_char do |char|
num = BASE_62_VALUES.index(char)
base_10_num += num * (base ** digit)
digit += 1
end
base_10_num
end

Decoding is a bit simpler, since we use a string as input. Iterate through it in reverse and get the index of each character. Then multiply that by 62 ^ digit, increasing digit after processing each character. We simply total all of these together as a base 10 integer.

Testing your encode and decode functions

I wrote a few explicit unit tests using Minitest, but got a lot more value from doing 1000 Monte-Carlo simulations for a single test.

def test_convert_b10_to_b62_zero
assert_equal(‘0’, LinkShortener::ConvertBase.to_base(0, base=62))
end
def test_convert_b10_to_b62_two_char_b62
assert_equal(‘1A’, LinkShortener::ConvertBase.to_base(98, base=62))
end
def test_convert_b62_to_b10_zero
assert_equal(0, LinkShortener::ConvertBase.from_base(‘0’, base=62))
end
def test_convert_b62_to_b10_two_char_b62
assert_equal(98, LinkShortener::ConvertBase.from_base(‘1A’, base=62))
end
# Reversibility Sampler
def test_monte_carlo_b10_to_b62_to_b10
1000.times do |i|
seed = rand(1_073_741_823) # Max 32 bit Fixnum
b62 = LinkShortener::ConvertBase.to_base(seed, base=62)
assert_equal(seed, LinkShortener::ConvertBase.from_base(b62, base=62))
end
end

The Monte-Carlo sampler rapidly tests that chaining your encode and decode function produces the original input value. We use the Kernel#rand function to generate a random number between 0 and a very large number, which might be the largest 32 bit int in Ruby. It runs extremely fast and gives us confidence that we hit enough test cases to detect bugs.

Now build some database models, drop this library into your CRUD app, and go!

--

--

harpermaddox

I love building things. CTO of EdgeTheory. Former TI-83 calculator hacker, Crewed a yacht that went halfway around the world.