Automatic database sharding with Alibaba Cloud Table Store

Published in

Xebia

10 min readMay 6, 2019

At some point in your application’s lifecycle, there might come a time when you need to start scaling your data storage. If you are storing media files or other blobs that have no relations between them, you can easily add storage capacity to solve the problem. For (semi-)structured data in a database however, scaling is a whole different story. Simply adding database instances is not enough. You will need to reconsider the usage patterns and decide what solution solves the problem you have. If your database is hitting resource limits because it is accessed very frequently, adding an asynchronous read replica might be the way to go. If the size of the data is the issue and lookups become very slow, you might consider sharding your database.

In this post, we will look at how Table Store can help us shard our data automatically. Table Store is a NoSQL database service in Alibaba Cloud. It is available in all regions of Alibaba Cloud. Because of its architecture, Table Store can store an unlimited amount of data and still keep read/write throughput at a very high level. It also offers secondary indices, search options and data streaming capabilities. For now, we will focus on basic operations to learn how data is stored.

Sharding in a nutshell

When your data size is becoming too large to efficiently query, you will often end up sharding your data. A shard is a partition of a full table that contains only a subset of the records. If you have two shards for a single table, each shard will contain half of the rows. Every record gets assigned to a shard when it is created. The target shard is determined by a hash of the primary key of the record. When the record needs to be retrieved from the table, the same hash is calculated. You then only have to traverse the target shard where the record lives and can skip all other shards.

Database sharding analogy: every drawer contains a subset of the data

A real-life analogy would be a closet of small drawers with patient information in it. When you need to store a patient’s information, you determine which drawer (or shard) the information card goes. Your sharding strategy is likely based on the first letter of the last name. Then you look up the drawer that contains cards with that letter and put it in. When someone requests a patient’s information, you look up the drawer again for his or her last name. Then you flick all cards and find the correct one.

Choosing the right sharding strategy

As you might see now, database sharding is very similar. The only real difference is the sharding strategy. In the real world, you would shard the data by the first letter of the last name. In a database however, you calculate a hash from one of the fields in the data and let the computer decide which shard it should fall in. The field that you use to shard the data is often called the partition key (or sometimes hash key). It is part of the primary key of a record.

But what do you do if your colleague John has the bad habit of putting the cards into a drawer randomly? You might have to flick all of the cards in a certain drawer to find the correct one. It might be smart to ask your colleague to sort the cards in the drawer. That will help you look up the right one even faster. A good choice would be to sort the cards by last name. You can split the stack of cards in half, see if the card you picked is lower or higher in alphabetical order than the card you need, split the remaining stack again and keep going until you find the right one. This is a binary search algorithm and makes it really efficient to look up a single value. This method only works because the cards are sorted, so you will have to make sure John adheres to the strategy.

In a digital database, the values in a single shard are also stored in a certain order. As long as the records are sorted, the computer can perform binary search. You have to decide on the sort order yourself, because the database can not decide what usage patterns you have. To make a decision you need to consider the usage patterns of your application. For key-value access patterns, where you always want to read a single value with a known primary key, you will probably sort by the partition key that was also used for sharding. If you need more complex queries, like range queries or filtering, you should add the fields that support those queries to the primary key. For example, if you need to query records by a date range, it would be most efficient to have the records sorted by that date. A final note is that every record should be uniquely identifiable in the end, which means that the combination of fields for the primary key have to be unique.

Looking at Table Store

Let’s see how Table Store helps us solve the intricacies of database sharding described above. I assume you have created an account and are logged in to the Alibaba Cloud Management Console. Open the ‘Products’ drawer and selected ‘Table Store’. Switch to the region closest to you before proceeding.

Before we can start creating sharded tables, we need to create a Table Store instance. An instance is a container around one or more tables. It is similar to a database in traditional database systems. Click the blue ‘Create Instance’ button in the top-right corner to create an instance. Fill in a name, choose a type (Capacity or High-performance) and fill in a description. The difference between the two types is not important here, but if you want you can read more about it here. Click ‘OK’ to create the instance.

When the instance is created, click on its name to go to the detail page. You will see some access URLs and a list of tables. To start creating your first table, click ‘Create Table’ in the top-right corner.

First you should decide on a name for the table. If you want to store user profiles, Profiles might be a good pick. Leave the advanced settings disabled for now and start building up the primary key. As you can see, the first field in the primary key will become the partition key. You might recall that this is the data point that is hashed to determine which partition the record is put in. If you want to store user profiles, the userId might be a good choice. That will allow us to access the user profiles as a key-value store. Fill in userId in the field and selected String as the data type. Using a UUID for user profiles is often a good idea, so that will be a string. Click ‘OK’ to create the table.

That is it. We’ve created a sharded table in Table Store that can contain an extreme amount of data. We don’t have to worry about scaling compute resources, database instances or read replicas. All of that is handled behind the scenes so we can focus on building our application.

To insert some data, click on the table name after you’ve created the table. Then click ‘Data Editor’ in the left pane. A newly created table is empty of course, so let’s input some data. Click the blue ‘Insert’ button to insert a record. We are asked to fill in the primary key fields, which for us is only userId , so generate a UUID here. Copy the UUID and put it in the userId column. Because Table Store is a NoSQL database, you can store any other fields you want in a record without specifying a schema. Records in a NoSQL database are sometimes also referred to as documents.

For our user profiles, we want to store the user’s first name, last name and age. To add the fields, click the ‘Add Column’ button. Fill in firstName, choose the String type and fill in your first name. Do the same for the lastName and age fields. Make sure the agefield is of type Integer. If everything looks like the image below, click ‘Insert’.

Inserting a record through the management console

You’ve just created your first record in a Table Store table! After the record is inserted, you will go back to the data editor to see what is in the table. Try to insert another record with the same fields. Or remove one of the records. You can also try adding a record with different fields to see what happens. Then delete that record again. The data editor will always reflect the records as a table, with rows and columns, but it is important to remember that you are working with a NoSQL database. That means that there is no overarching schema for all the records in a table, only the primary key is defined up front.

Managing data programmatically

In your application, it is clearly not feasible to manage all the data by hand with the data editor. Luckily, Table Store has SDKs for several languages so you can access your data in code. Among the supported languages is Python. I will show you how you can create and update records in Python easily using the SDK. I have set up a repository with a simple Flask application that should make it easy to get started.

To get started in your Python project, you need to install the SDK:

pip install tablestore

Then, you can instantiate the OTSClient that allows you to interface with all the Table Store APIs:

You have to import the `OTSClient` from the tablestore package. Then you can instantiate it with an endpoint, access key ID, access key secret and the instance name. The endpoint and instance name can be found in the Table Store dashboard of the instance you created. Make sure to use the internet endpoint if you want to talk to Table Store from your local machine. An access key can be created on the Security Management dashboard. It is used to programmatically access the resources in your Alibaba Cloud account.

In the example above, the Table Store information and credentials are loaded from the environment. You can replace it with the actual values or make sure to set them before running the application.

Now let’s create an endpoint for fetching a profile:

In the /profile/<user_id>route, we define the user_id as a primary key field and call client.get_row() to fetch a row from the Profile table. As we would like to return JSON from our Flask endpoint, we need to interpret the resulting row and transform it in to a Python dictionary. To do this, we add all primary keys and attribute columns (like firstName , lastName and age ) into the dictionary. That dictionary is then transformed into JSON and returned from the API.

The process of storing a record in the table is very similar:

Here we create a POST endpoint with the same URL as the endpoint used to fetch records. A Row object is then created from the primary key and the attribute columns. The primary key is just the user_id , while a list comprehension is used to get all other columns as a list of tuples. That is then set as the attribute columns for the Row . Finally, we call client.put_row() to store the record. Note that this is an idempotent store operation. If a record already exists with the same ID, the current contents are overwritten.

Table Store has many more options like versioning of records and fields, secondary indices that help you look up data with different access patterns and a full-fledged search index to search all the data. If you want to learn more about those, please refer to the documentation.

Wrapping it up

We have seen how the basics of database sharding work, how Table Store enables us to easily shard data and how you can use the Table Store SDKs to access your data programmatically. The code repository contains a working example of the Flask app that you can boot locally as follows.

Clone the repository
Create a .env file in the cloned directory with the following contents:

OTS_ENDPOINT=endpoint
OTS_INSTANCE=instance-name
ACCESS_KEY_ID=access-key
ACCESS_KEY_SECRET=access-key-secret

Replace the values with the Table Store endpoint, instance name and access key information relevant to you.

Use Pipenv to create a virtual environment and install the dependencies:

pipenv install

Run the application:

FLASK_DEBUG=1 FLASK_APP=src/app.py pipenv run flask run

This uses the Pipenv virtual environment to start up the Flask app. It will expose the /profile/<user_id> GET and POST endpoints to fetch and store data. Make sure your Table Store instance contains a Profile table with the corresponding primary key (a single userId field that is a String ).

Lastly, if you are familiar with Terraform, you can use the Terraform template in the repository to provision a Table Store instance and table. It also specifies the key structure of the table so you should be good to go after the template is deployed.

Happy sharding!