Building a Scalable Tinyurl Application
With Python, Docker Compose, and Kubernetes
Let’s say you have an application that works well on your laptop or on some server with a few users. How would you scale it to millions of users?
Scalability is absolutely critical to user experience: how can we sustain reasonable user experience regardless of the number users on the system? These are the questions we will address in this blog step by step, with a concrete example.
After reading this blog, you should be able to:
- Organize your application functions for scalability
- Use Docker Compose to develop and iterate rapidly through a distributed application on your laptop
- Take that application and deploy in-cloud in a few minutes via Kubernetes for a larger audience!
Furthermore, Python is a great choice for developer productivity.
We’ll discuss design of the well-known Tinyurl application, which we’ll build from scratch.
Tinyurl application is a popular system design topic too, whose function is converting a long url into a shorter version. Although this application appears simple from the functionality side of things, we still may run into all the scalability pitfalls that could be found in more complex applications. This gives us the opportunity to work on design topics without being bogged down with functionality details.
We will deploy the app in Docker Compose for local testing and later in Kubernetes for scalable deployment in a public cloud.
Specifically, we will create a REST service that will provide APIs to create a tiny url and retrieve the original url.
We will start with the high level architecture as well as the software stack and planned deployment options. You’ll be introduced to the Dockers and container management software, specifically Docker Compose and Kubernetes, which constitutes a helpful toolset for building scalable/distributed apps.
We will break the application into multiple autonomous services, which facilitate independent scaling of parts of the application. Kubernetes is a great enabler of scalability for such micro-service based applications. Further, a simple and and easily understood design goes a long way in building software whose scalability can be improved over time. Our core logic is probably around 50 lines of Python!
Meanwhile, Docker Compose makes local development simple and fun.
Let’s start with the APIs we need:
- Given a url, return the short url
- Given a short url, return the original url
bit.ly is a popular online url shortening service, which you may experiment with to get a feel for what we are trying to achieve here.
Before we dwell on individual services let’s list the performance and scalability requirements for our APIs.
Requirements and API
API: url shortening
- Response time should be less than 1 second
- 100 urls shortened per second
- “Massive” number of urls actively managed
API: return original url
- Response time should be less 100 ms
- 10000 urls returned per second
Performance expectations from the API to retrieve the original url is higher because it will be requested frequently and hence it is critical for the end user experience. Also, services should be “elastic”and scale automatically by bringing more nodes into service as traffic increases.
The Performance and throughput numbers that we hope to achieve is a function of system design and hardware. A fully scalable system can be scaled by throwing more hardware; thus making arbitrary throughput numbers such as those listed above possible.
In reality however, a sub-system in the application can become a bottleneck as an attempt is made to scale the system. We shall iteratively locate and eliminate such bottlenecks going forward.
Now, let’s start with a high level design where we break the application into independent services and choose suitable software stack for them.
We have a frontend server implementing REST APIs, which needs a database for storing and fetching urls. The frontend server is stateless and started up in multiple machines to handle increased load. In contrast, the Postgres database we chose is stateful by default and can soon become a bottleneck. Therefore we add a Redis cache to protect Postgres.
The simplicity of the architecture, illustrated in the picture above, should help us easily locate bottlenecks as they occur going forward.
The load balancer dispatches the incoming requests between a multiplicity of frontend instances in a cloud environment.
- Postgres database: persistent store for urls. There are a number of other options, we will start with a familiar RDBS to keep things simple for the time being.
- Redis cache: we can use a cache considering high read (in order to get original url ) to write (create tiny url) ratio
- API server/’frontend’: API server will orchestrate Postgres and Redis services to implement the REST endpoints. We choose a Django, a production grade web server, to host REST endpoints. There are a number of other options such as Node with Express/Hapi or Java Spring Boot. We like Python/Django because it leads to scalable yet easy to understand code, which is our intent here. Besides, Django comes with a lot of essential elements such as user-management and templating built in. From here, you can easily transition into a productive system.
We will containerize our services and during local development will deploy the app in Docker Compose , which allows us to start and stop all the services with a single command. This also provides the convenience to develop and iterate through code quicker.
Finally we will deploy our application in cloud/Kubernetes without a code change.
Now that we have a general plan in place, here’s a short introduction to Docker, Docker Compose and Kubernetes.
Docker is a similar to a virtual machines technologies such as VMware and VirtualBox, except it is far more efficient and lightweight because Docker containers work directly on an underlying host OS kernel. VMware and VirtualBox technologies on the other hand add a guest operating system on top of the host operating system (for more on the differences see here).
Docker image and container
Docker image is a declaration of an operating system image with layers of software on top that you want to have for a specific purpose. e.g. a Node/express web server. What you get when you run a docker image is a Docker container.
Dockerfile is a text file containing instructions that describes how to build a Docker image that should be run as a container (e.g. an Ubuntu image with Python 3, Django + your application code). We will create a Dockerfile for each of our services.
Docker provides tools to build and run applications, declared via Dockerfile, as container instances.
Docker-compose helps deploy and run multi container applications declaratively configured in a text file/YAML. The YAML in turn references individual Dockerfiles required for the application.
Now let’s get to the code, service by service. It would be a good idea to first install all the required software and run the app after cloning it from git. This will help follow along as we review various parts of the application.
- Install docker: Mac or Windows
- Install Docker-compose (typically Docker Compose should come installed with docker installation in previous step; perform this step only if you cannot run command ‘docker-compose -version’ on your terminal)
- Install git
- On a suitable folder, say <tinyroot>, clone the tiny url git repo https://github.com/irnlogic/tiny and run the app.
<tinyroot> git clone https://github.com/irnlogic/tiny.git
<tinyroot> cd tiny/dockercompose/
<tinyroot> docker-compose upVisit http://localhost:3000, tinyurl app should be running. Hit Ctrl-C on the command line to stop the app.
‘docker-compose up’ command builds required docker images and starts up the container instances as configured in docker-compose.yaml. You can visit http://localhost:3000 on the browser to interact with the end points.
The console log should display a few performance numbers as well, which give you some insights into ranges of response times involved with caches like Redis and RDBMS like Postgres.
Code and implementation
See the folder structure below. Each service gets its own sub-directory under dockercompose folder. kubernetes contains descriptors needed to deploy our application in Kubernetes.
<tinyroot> - dockercompose # Docker-compose and Dockerfiles
-- db # Dockerfile Postgres
-- redis # Dockerfile Redis
-- django # Dockerfile and source for Django
- kubernetes # Deployment, service descriptors
We simply use Dockerfile based on Postgres image at the docker hub. You may search for available images on docker the hub at https://hub.docker.com to locate other images and versions.
Here the FROM command sets up a base image, which in this case comes with Postgres installed. We’re not adding more layers on top of the base image — we might as well have directly used the base image!
This docker file uses ‘postgres’ image of version ‘11.1-alpine’. A container instantiated from this image will have a running Postgres listening on port 5432. Leaving out version will draw latest version of the image. We specify an explicit version to avoid potential incompatibilities when a new version of the image is published.
Congratulations, you have a basic Postgres server image ready!
We take standard Redis image and CMD starts Redis server on start up of container. Once again, we are not doing much with Dockerfile yet.
Django web server
The Django web server implements the rest API end points and interacts with Postgres and Redis services.
Let’s review Django source code under tiny/dockercompose/django/tinyapp, which is organized into folder structure as illustrated below.
│ ├── __init__.py
│ ├── settings.py
│ ├── urls.py
│ └── wsgi.py
│ ├── lib/tiny.py
| ├── migrations
│ ├── views.py
│ └── urls.py
Many folders and files above are part of Django “plumbing”, which you can understand by reviewing this tutorial. For now it is enough to concentrate on items shown in bold under folder ‘tinyurl’, which contains relevant code. This folder serves as a self-contained Django ‘application’ with routes, views, migrations and core application logic.
- migrations/models.py — declares “Url model”, which also translates to Postgres table structure for storing urls
- lib/tiny.py — module containing core logic for reading/writing urls using Postgres/Redis
- views.py — simple views for rendering url end points, uses lib/tiny.py
- urls.py — routes pointing to view above
When an API is called, urls.py triggers a specific “view’ in views.py, which calls relevant functions in tiny.py, the output of which is mashed with a template for rendering the response. You should be able to easily trace this, even without a deep understanding of Django framework.
Let’s start with database schema for urls.
Our model is declared in models.py, a Python class:
from django.db import models
shorturl = models.CharField(max_length=10, primary_key=True)
originalurl = models.CharField(max_length=300)
A model Url is declared with two attributes, which will result in a simple relational table for storing urls consisting of two columns.
- shorturl — short code of url generated by our application, which is marked as primary key. Primary key acts as an index, hence helpful in fast lookup of original url.
- originalUrl — original url
The Commands below:
- generate the migrations, which describe how to move from one version of database schema to another or vice versa
- generate the Postgres table based those migrations. These are included
python tinyapp/manage.py makemigrations
python tinyapp/manage.py migrate
Here is the generated migration in our case:
OK — our high level algorithm is as follows:
Generating tiny url — for a given url a hash short code is generated and resulting tuple of short code and the original url is saved in the Url table.
Retrieving the original url — given a url, original url can be obtained by a simple query in Url table with the short url in the WHERE clause, which is then cached. Subsequent requests will be answered from Redis cache.
Moving on to tiny.py…
g_redis = redis.Redis(host=’redis’, port=6379, db=0, decode_responses=True)
Note: See use of the host name ‘redis’ for connecting to Redis service. Each service container joins the default network setup by Docker-compose that is reachable by other containers with hostname identical to the container name. Please refer to our docker-compose.yml file in which the service/container name is declared ‘redis’.
Next the following straightforward helper functions wrap around Redis’s set (key/value) and get(key):
Here is the core logic to generate and persist short url: get_tinyurl, where the real works gets in _get_or_create_in_db:
- A 32 character md5 hash is generated using hashlib module and its last 6 characters are chosen as url short code. This puts a cap on number urls we can generate. If we take the entire hash, our url is not “tiny” anymore. The while loop below checks if the chosen hash segment is assigned to a different url, if so, we slide left over the md5 hash and pick a new 6 character window as a candidate tinyurl code. We arbitrarily make a maximum of 10 attempts to resolve hash collision, although we should never reach that situation. One good thing is that the generated hash has the property that for identical urls, the same hash is produced.
- The resulting short url and the original url are then saved to the Postgres database using Django ORM interfaces.
The final piece in the puzzle is get_originalurl, which retrieves the original url given the tiny url. First an attempt is made to fetch the original url from the Redis cache. If it has not been cached then we fetch the original Url from Postgres, cache it and return the original Url.
The Django Dockerfile, generates an image containing Python/Django code discussed above.
RUN mkdir /code
ADD requirements.txt /code/
RUN pip install -r requirements.txt
ADD src/ /code/
ADD start_django.sh /code/
- The first line in ‘FROM python:3’ takes standard Python image from docker hub, over which we install software and our code.
- RUN commands runs a command — i.e. adding a layer on top of base image. ‘RUN mkdir /code’ creates directory ‘code’
- ‘WORKDIR /code’ sets ‘/code’ as working directory for subsequent docker commands below.
- ‘ADD requirements.txt /code/’ copies requirements.text from the directory containing ‘Dockerfile’ to ‘/code/’ directory. Requirement.txt lists Python modules needed by our application. e.g psycopg2 — Postgres client, Redis — redis client
- ‘Subsequent ADD commands copies ‘src’ folder and ‘start_django.sh’ to ‘/code’ folder
- Finally ‘CMD ./start_django.sh’ executes commands in ‘start_django.sh’ during container run time — i.e. each time container starts up (in contrast RUN commands will be run one time when image is built!). Shell script ‘start_django.sh’ allows us to run multiple commands such as creating, running migrations and staring up the Django web server.
It’s quite a hassle to start a number of docker containers and set them up to talk to each other. This is where Docker Compose comes in — it allows you to define all your services in a single configuration and start all of them using a single command.
In our case tiny/dockercompose/docker-compose.yaml defines all of the services that make up our Tinyurl application:
At the top of the file ‘version: 3’ declares the version of format of the Docker compose yaml file we are using.
Under the services in the yaml file, you can notice four services: redis, postgres, adminer and frontend. You can ignore ‘adminer’ for now.
The name of the service is ‘redis’ and the build specification ‘build: ./redis’ tells us which image to build when this service is started, which in this case is Dockerfile under directory ./redis. Alternatively an image in the docker hub could have been invoked using an ‘image’ tag, which we choose not do here.
‘ports’ section “6379:6379” will map a port on the container(number to the right of colon) to a port on the host computer(localhost here) both of which are identical in this case.
The service name in Docker Compose acts as an end point for accessing Redis service. For example, another service running in Docker Compose network can refer to Redis using host name ‘redis’ and 6379 as port.
Likewise the Postgres service is linked to Dockerfile under ./db directory. The ‘environment’ section is used to declare the Postgres username and password, which then becomes accessible within the resulting Postgres container. The Postgres image in question recognizes these environment variables to configure itself.
This is the REST server in Django and is setup similarly. depends_on declares a dependency and causes Redis and Postgres containers to be instantiated before Django service.
For a more formal documentation of Docker Compose yaml file, please see here.
Deployment in Kubernetes
For now you may follow step by step instruction here to deploy our Tinyurl app in Google Cloud Kubernetes. Kubernetes deserves a more detailed treatment than can be accommodated in this blog. Nonetheless, you should be able to correlate with your use of Docker Compose so far. The service definitions Kubernetes are similar to those found in Docker Compose, except it also offers ‘Deployments’, which provide a fine control over scaling aspects such as number of computing units, memory, CPU etc. A Kubernetes ‘Service’ essentially offers stable end points for other services to communicate with Deployments.
In short, Kubernetes will do in a cloud environment what Docker compose did for us in our development machines: manage containers.
Conclusion and next steps
In the next installment of this topic, we hope to cover Kubernetes in some detail.
Meanwhile, a few limitations in current version of the app are worth mentioning:
- No volumes are mounted for Postgres, urls can be lost after restart
- HTTP GET is used for implementing REST end points, which facilitates easy testing on browser, however functionality can break with some Urls. This can be easily remedied by using HTTP POST instead
Also, performance testing is needed to see how our application scales and explore ways to scale up further.
At this point:
- You have used Docker and Docker Compose to build and run a distributed architecture Tinyurl application on your laptop
- Deployed it on Kubernetes in public cloud for potential access over internet by a larger audience