The Data Warehouse Definition

The other day I was listening to a software engineering podcast.

Don’t remember the exact title, but at some point the host asks the guest: “can you explain the difference between an operational database and a data warehouse?”. And the guest goes: “well, the operational database is the regular database on which your application operates, a data warehouse is a database where data, instead of being arranged in subsequent rows, it’s arranged in columns…

That’s not right is it? So, what is the difference between an operational database and a data warehouse?

The as the guest of the podcast rightfully pointed out, an operational database is the primary database on which your application operates. It generally represents a normalized version of your data model.

Let’s take this oversimplified example. You have a Customer table and an Inventory table and an Orders table and everything is neatly laid out and they’re all related somehow.

Organizing data this way is very efficient to implement business logic on top of: you add order to the Orders table, add new inventory to the Inventory table, and so on.

But what about reporting, let’s say you want a report that for each order in the last month display the inventory product name and unit price, the customer’s zip code, the order amount and date. This will require an expensive join query. Then imagine you have an army of executives and marketing people looking at these reports all the time to figure out how to improve the business. This sort of query can be quite slow and in addition, because it runs on the same database, it may significantly slow down your production system. So what to do?

It would be good if we had a second database, with data derived from the operational database, but organized so that the analytic queries ran by executives and marketing people return a lot faster. Typically for your report discussed above, you will set up a table with all the required columns so you can serve up your report in one query with no joins. This is generally called a denormalized data model.

You could feed this new database in various ways, either in real time as your business goes on, or in batches at the end of day or whenever the load on your production system is low so that user experience on the production system is not impacted.

Well, this second database is your data warehouse. A possible definition for it could be: A data warehouse is a secondary database holding one or multiple denormalized copies of your data, optimized to speed up functionalities such as reporting, charting, statistical calculations, data mining or machine learning without affecting the performance of the production system.

The form of storage should depend on usage.

  • For tabular reports, a relational database or something like Cassandra holding a denormalized version of your data model would be appropriate
  • For statistical calculations over a metric or for charting a time series variable, a columnar database would be more appropriate