Basics of Building a Data Warehouse: Part 1

Published in

Charting Ahead

5 min readJul 15, 2017

Source: https://pppre.s3.amazonaws.com/2e5adf67004f3eea/2fa86eee4da5447aaef978437e2eaacf.jpg

“Hey Number One Employee, I want you to help transform us into a data driven company.” You may be an architect or lead engineer that has gotten this question from your C-level executive. This begs the question: what does this transformation require?

Your first response may be to turn to your old friend Google for guidance. However, you will quickly notice that there is a shortage of information about how to actually build analytics infrastructure. The first piece of this puzzle is building a data warehouse.

What is a Data Warehouse?

A data warehouse is a central repository of an organization’s important data optimized for large scale data aggregations for reporting and analytics. It is used by a handful of analysts trained on SQL (the language of relational databases) trying to answer broad questions. It is NOT a transactional database for applications that frequently pull small amounts of data.

Should I build a Data Warehouse?

Deciding whether or not your company would benefit from a data warehouse can best be determined by examining the questions that people have about your companies data. Does the sales team want to know how sales have been doing over the years? Does an insurance team want to know what types of claims they’ve been getting? If these are the sorts of questions being posed, then the answer is yes, you do need a data warehouse!

If the questions center around specific customer data, such as what did customer 1234 buy? Then the answer quickly becomes a resounding no.

Database Selection

Selecting a database is (arguably) the most important aspect of building a data warehouse. However, it is the most daunting part of the journey due to all the decisions that need to be made to adequately assess your company’s unique needs. Are you in the cloud? Do you have your own data center? Do you need to perform complex processing requiring procedural language? What kind of scale do you need?

“Shared Everything” vs “Shared Nothing” Databases

The first decision point when selecting a database is whether you need a “shared everything” or a “shared nothing” database. More information about the definitions of these databases is available via this link: https://www.quora.com/What-are-the-differences-between-shared-nothing-shared-memory-and-shared-storage-architectures-in-the-context-of-scalable-computing-analytics .

Cases where “shared everything” is a better fit:

Require SOME transactional queries;
Contain staff with in-house experience in databases like SQL Server/Oracle/Postgres;
Include a Database Administrator for tuning/configuration;
Possess existing licenses for Oracle or SQL Servers; and
Constrained by a limited budget.

Cases where “shared nothing” is a better fit:

Contain large amounts of data;
Do not have a lot of Database Administrator resources to assist with tuning/configuration;
Contain strict service level agreements requiring rapid data processing; and
Require the ability to quickly scale out.

Source: https://upload.wikimedia.org/wikipedia/commons/thumb/0/07/Shared_Disk_Architecture.jpg/500px-Shared_Disk_Architecture.jpg

“Shared Everything” Databases

As you can see from the list above, “shared everything” databases require a great deal of tuning, tend to be cheaper, and can serve some transactional processing duties. Although the “shared everything” database can perform transactional processing, it is imperative to limit that activity to avoid the database being overwhelmed. “Shared everything” databases are a viable and cost-effective approach for small (gigabytes) to medium (tens of terabytes) sized warehouses.

One of the biggest limits of a “shared everything” database is scaling. Once you have run out of options to tune your queries, you will have to buy a bigger server (i.e. more processors, RAM, faster storage) in order to support the higher volume. Unsurprisingly, this can get exponentially more expensive as you start to move from medium to large servers, such as Oracle Exadata. Some “shared everything” loyalists may tell you that you can buy more servers and cluster them, but this will only offer limited scaling.

“Shared Nothing” Databases

Source: https://upload.wikimedia.org/wikipedia/commons/thumb/1/1c/Shared_Nothing_Architecture.jpg/500px-Shared_Nothing_Architecture.jpg

“Shared nothing” databases offer a better solution for data warehousing, but at an increased cost.

Another downside of “shared nothing” databases is the lack of features that OLTP databases provide. Indexes, physically- enforced primary and foreign key constraints, and explicit partitioning can be missing from some vendors’ offerings.

In addition, you will notice a lag in your queries, regardless of the amount of data is being processed. This is due to the high upstart cost of the architecture.

An advantage of “shared nothing” warehouses is that each node processes its chunk of data independently, resulting in easier and faster scaling.

Normally, more established vendors have more analytic-focused features that you can take advantage of. Examples of these features include Teradata’s Hadoop and R integration.

“Shared nothing” warehouses offer a good solution for medium to large sized warehouses. Overall, this architecture does provide the best overall analytic solution and will scale with your data.

The Change from the Cloud

As many of you know, the cloud has fundamentally changed the way software functions, and analytics is no exception. It has driven down the cost of data warehousing and lowered the barrier to entry. You can now get a database for cents per hour that does not require a team of database administrators to administer!

One of the greatest examples of this is the Amazon Web Services Redshift platform. With Redshift, you can spin up clusters with a few buttons for extremely cheap. Before Redshift, you’d have to buy millions of dollars worth of appliances and manage it yourself. If you want to host in the cloud (which I highly recommend you consider), I would check out Redshift.

One caveat to this recommendation is that Redshift is highly optimized for warehousing. As a result, it will stutter or crash, if you attempt to run transactional workloads on it. In addition, it lacks primary and foreign key constraints. This means that if you duplicate data, you are pretty much screwed. unless you built some checks yourself.

Non-Relational Databases

What are your thoughts about Hadoop, NoSQL, or some other non-relational database? I get this question A LOT. As a result, I want to make one thing clear: DO NOT USE ANY NON-RELATIONAL DATABASE FOR YOUR DATA WAREHOUSE.

Non-relational databases are far from mature, and should only be used for specific-use cases. Hadoop has many SQL solutions. However, these solutions are bolted on after the fact, meaning that they don’t work the same as a normal database and have limitations that make warehousing challenging. Hadoop (for now) should be relegated to data processing and data archival purposes. Although I don’t recommend Hadoop at this time, I believe it will become a viable solution, as it matures and evolves.

Thank you for reading, and stay tuned for Part 2!