Data Explosion: Scaling a Data Warehouse

Seth Goldberg
Corsair's Business
4 min readJul 21, 2017

--

So you’re lucky enough to work at a growing company with good strategists, and now you’re company is seeing serious growth. What could possibly go wrong??

The answer is: A LOT, provided an organization does not appropriately manage growth. Doing this with technology infrastructure is usually pretty easy; simply buy a few more servers or beef up the existing ones. However, when it comes to databases, managing growth isn’t as straightforward. Adding more horsepower isn’t enough, if you don’t scale your data warehouse.

Two Methods of Scaling

There are essentially two ways to scale a database (and technology in general): 1) make your existing code as efficient as possible, and 2) throw more hardware at it. Making your code as efficient as possible is the usually the most effective way to scale, while adding additional horsepower should only be used as a last resort.

This article explores scaling relational data warehouses in the most broad sense. Please keep in mind that not every database will support the features listed below. If you are unsure what your options are, contact your data warehouse vendor.

Query Optimization Strategies

One of the first things I check is the quality of the queries. Bad queries will bog down a database and resist nearly all forms of optimization, unless fixed. When assessing your queries, here are some things to look for:

  1. Is the query filtering on a column that has a function applied to it? If so, rewrite it so that the source column does not have a function applied to it. This forces the database to apply the function to every single row on the table first, allowing it to perform two operations, rather than one.
  2. Is every table joined as much as possible? Joining as many tables as possible will give the query optimizer more options to select the best plan possible
  3. Are all the tables filtered as much as possible? Filtering data means less data that needs to be processed.
  4. Are tables joined from smallest to largest? Following this order, will make the most efficient use of resources.

Table DDL Strategies

The manner that you construct your tables plays a critical role in making the most of your resources. In order to construct your tables in the most efficient way, I suggest you follow these strategies:

  1. Optimize your database for data warehousing. Making your block sizes larger and favoring full table scans over nested loops will increase your database performance.
  2. Partition larger tables? Partitioning can help performance in two ways: 1) allowing the database to skip past unneeded data more efficiently, and 2)making joins perform better when done in parallel.
  3. Index wide tables on the frequently used columns, which will allow the database to skip scanning the entire table and scan the smaller index.
  4. Store data in a columnar format. Similarly to indexes, columnar stores allow the database to skip unneeded columns. However, they do this in a more efficient manner. Be warned that if you need to perform any transactional processing, this will severely degrade performance.
  5. Partition data across multiple disks, which will allow you to take advantage of the read performance of multiple drives, rather than just one.
  6. Ensure you follow the best data modeling strategy. Dimensional models are much better suited to analytical queries.
  7. Compress the data in the table. Table compression shrinks the amount of data, and speeds up query execution time. Although some people worry that decompression will bog down the CPU, the gains in I/O performance greatly overshadow the additional CPU overhead.

Examine Database Hardware Options

If you have gotten to the step of having to visit your database hardware, you have tried everything above.

The first option is to scale vertically, which is the simplest, but most expensive option. It requires your company to buy a bigger server with more CPU/RAM/SSD, etc. It should be noted that once you get into the “big” box territory, your costs will quickly spiral out of control. The link below gives you an idea of how this can occur:

The second option is to scale horizontally. With shared-nothing databases, this adds more horsepower that typically scales linearly. When dealing with shared-everything databases, the situation becomes more complex. You have to ensure all of the nodes in your cluster can run the normal queries, since the individual node processes the entire query. In addition, you have to contend with nodes sharing the storage infrastructure, which can quickly become a bottleneck.

Ending Notes

Although this is not an exhaustive description of scaling a data warehouse, it is enough to give you a good idea of where to begin. In this day and age, it may seem easier to click a button and instantly scale up your database resources, but please try to avoid this impulse. Instead, I would advise that you make your data warehouse as efficient as possible, as this will lead to less maintenance.

--

--