Evolution of the Modern Data Warehouse

Paige Roberts
11 min readJul 24, 2020
Woman with umbrella on beach as giant tsunami wave is about to engulf her — symbolic of big data overwhelming data warehouses

There are a lot of definitions of the data warehouse. I grabbed a random definition off the web. It fits the general understanding in the data management industry of what a data warehouse is, and what it isn’t.

It’s also wrong.

“Data warehousing is a technology that aggregates structured data from one or more sources so that it can be compared and analyzed for greater business intelligence.”

If you’re looking at that definition and thinking, “That looks right to me,” then read on. Once upon a time, I probably would have agreed with this definition as well. But times have changed.

The processes and technologies of data warehousing have changed a lot in the last ten years, but as industry professionals, we often still think about data warehousing the same way. In our minds, we’re still using the same definition of a data warehouse that we used a decade ago.

That definition.

So, in what ways is that definition wrong?

Rise of Big Data

In the last decade, the data management field has radically transformed. Most people don’t believe that change touched the data warehouse in any way. That is a misconception.

Doug Laney’s classic three V’s hit the data management industry with a tsunami of data, and with it, a sea change in the types of analytics we can do with that data. Old school analytical databases had too much data to handle affordably without valuable data falling through the cracks. We had business demand for analysis of types of data we’d never even tried to deal with in the past, semi-structured data like JSON and Avro, log files from sensors and components, geospatial data, click stream data, and on and on. We had data coming at us too fast for the old technology to take it in, scrub it, combine it with other data sets, and provide it to the business in a useful way.

But that was only part of the problem.

Having all this data meant we could do things we’d never done before. Machine learning, data science and artificial intelligence were all fields of study back in the 90’s when I was at college, but we didn’t have the full capacity to put them to useful work back then.

The problem was also the promise.

Having tons of data in all these varieties of formats enabled new and exciting, and potentially industry-disrupting analyses. Early experiments showed machine learning could provide impressive improvements in existing systems, whole new systems for making organizations more successful, and even whole new industries and business models.

The challenge was figuring out a way to store and process all that data to get it into a good form for all those cool new types of advanced data analytics.

Woman with umbrella in dark rainy alley, reflection in puddle is bright and happy woman with sunshine

Hadoop to the Rescue … or Not.

Along came a cute yellow elephant with a life boat promising to store all that data affordably, and process it for us, plus give us a great platform to do fancy big data analytics like machine learning. It seemed like exactly what we needed. This was the new hotness.

Throw away your old and busted data warehouse that you’ve been running essential business intelligence on for decades.

What could go wrong?

Obviously, a lot of things. Putting all their important data on a giant group of reasonably priced servers along with a bunch of other data didn’t work out so well for a lot of companies. That was never the intention of the people who invented the data lake concept, but that was largely what it was used for; a dumping ground for data. It was a great place to archive years of data that wouldn’t fit in transactional systems, and didn’t seem valuable enough to put in the data warehouse itself. Great. It was all stored. Then what?

Let’s not even talk about Business Intelligence. Hadoop vendors claimed data warehouse-like capabilities, but a query engine on top of a big pile of data does not a data warehouse make. It lacked concurrency, so only a few people could use it. It lacked security, it lacked governance. Above all, it lacked the reliability and performance speed that was the hallmark of the data warehouse.

Data scientists were supposed to be the ones who could do amazing things on Hadoop. They sifted through giant heaps of data, spending almost all their time just trying to turn that data into something they could use, so they could finally do the cool predictive data analytics they were hired to do.

They frequently just grabbed a few samples of the kinds of data they needed, and went off to do their work in isolation on sandbox environments, or their own laptops. They used technologies like R that did exactly what they needed, quickly and easily, but were never meant to scale to work on multi-server data lakes.

So, even when a data scientist managed to create something that could be hugely useful to the business, it wasn’t even close to being ready to put into production.

The Rise of the Data Engineer

This part feels a bit personal to an old data integration hand like myself, but finally, finally, everyone realized that production data pipelines mattered. Even in the old days of data warehousing, ETL was considered unsexy plumbing for the data warehouse. The database and the visualization tools were the cool kids. Now, suddenly people realized that without that plumbing done right, nothing works.

So, essentially, someone else had to take this cool model the data scientists built in isolation on a single machine, and figure out how to recreate everything they did, but on a massive distributed production scale. They might re-do a lot of the data pipelining in technologies like Kafka for streaming data, Spark for distributed ETL and machine learning. Sometimes, by the time that months-long re-building project was done, it didn’t work anything like the original model the data scientist created. Oh, well. Back to the drawing board.

It was no wonder that estimates range from 60–90% failure rate on data science projects. That was not good for business. All these cool advanced data analytics are just a cost sink until they make it into production where they can make a difference to the organization.

But What about the Data Warehouse?

So, what does all this have to do with the data warehouse? Well, analytical databases, the heart of data warehouse architectures, didn’t just sit on their laurels while all this was going on.

Analytical database vendors saw the advantages of scaling indefinitely on commodity servers that Hadoop made so clear. In many cases, they re-designed their products to work on clusters of commodity servers. Massively parallel processing (MPP) isn’t just a thing Hadoop does. Most analytical databases do it also. In most cases, they already were multi-threaded to distribute workloads across multiple cores. It wasn’t that big a leap to distributing workloads across multiple computers.

The Cloud has been promising to be THE way to do things for a long time. Now, it’s finally taking off. One reason is massive, scalable, affordable storage, that you don’t have to scale and manage yourself.

Nice. A lot of the advantages of Hadoop, but you don’t have to buy the hardware or manage the software. Unsurprisingly, that combination is appealing to many companies.

So, most analytical databases added the ability to work on the cloud. Some got adopted by a cloud vendor, like ParAccel branched off and became Amazon Redshift. Google, and Azure have their own brand of analytical database. Some new folks built their analytical databases to work only on the cloud, like Snowflake. Vertica, where I work, added all three of the clouds as supported platforms so the customers can decide if they want to run on-premises, on a cloud, hybrid, or multiple clouds.

Aren’t data warehouses expensive to scale?

No more than any other software. And yes, I include open source software in that. Open source software may be free, but only if you don’t actually use it in a business. My boss, Joy King, the VP of Product at Vertica, says, “Open source software is free, like a puppy.” It’s a good equivalency. Initial cost is free, but upkeep, training, care and feeding, etc. has a cost associated with it. In the end, TCO is comparable to other types of software. The advantages of open source software are not about money. Open source is free like free speech, not like free beer.

There’s a lot of variation in cost, of course, between proprietary analytical databases, but some are well within the price range of a good open source solution with enterprise support.

Cloud database costs are partly compute usage-based, so total cost of ownership (TCO) will vary from month to month. Even cloud-only database companies admit that can be a bit of a shock to your CFO if they’re not expecting it.

On-premises, TCO is more predictable, but it includes a company owning and managing the hardware, regardless of what software you use. The main point remains: yes, you can scale a modern data warehouse for a reasonable price, as large as it needs to be. So, there’s that first V taken care of. Volume is no longer an issue.

But data warehouses only handle structured data, right?

This is another thing that has changed quite a bit over time. I’m not as familiar with how other databases handle this, but Vertica has some built-in artificial intelligence that auto-parses JSON files, figures out best guess data types, compression algorithms and such. This allows a data warehouse to handle two V’s at once since a lot of data streaming in fast from something like Kafka is JSON data, and Vertica has a direct Kafka connector.

Vertica can also query Parquet and ORC data directly. Complex types in Parquet like maps, structs, and arrays can be queried in place, no exploding of the data required, so the beauty of those data types isn’t compromised. Of course, internal database storage is better optimized for query speed, so it gets better performance. Vertica lets you import and export Parquet data if you want to put high value data in your database, and push lower value data out, but keep it accessible. Data life cycle management is relatively easy.

There are other related functionalities in Vertica and other distributed analytical databases such as built-in geospatial analytics and time series analytics, but this gives you the general idea. Complex data, and semi-structured data is no longer an issue. And neither is streaming data. Vertica lets you stream in data constantly, while querying in parallel.

And Vertica has competitors. As much as I might like it if my competitors took a nap and didn’t improve their products, they were awake for that whole big data hullabaloo, too.

As were the pipeline and ETL vendors. None of us were napping for the last ten years, folks.

Modern data warehouses do just fine with big data sets, highly variable data sets, and rapidly moving data sets. The three V’s are not a problem for a modern data warehouse.

What about data science?

The cool thing about having all this data available is being able to do new things with it: advanced data analytics like machine learning, predictive, etc. Data warehouse technologies were designed to do business intelligence which is generally thought of as backward-looking; understanding current state of the business, or understanding past trends.

Vertica includes machine learning algorithms built to work on the optimized internal data format, and to work across multiple nodes at once. No data engineering required. It has advanced data analytics preparation functions such as anomaly detection, missing value imputation, data set statistics, fast joins of disparate data sets, etc. It also has model evaluation and model management as a first-class citizen, similar to how it handles tables.

Again, I know Vertica best, but other proprietary databases have ridden this wave, too. There’s a reason why it’s called Google ML. Oracle Autonomous database has R built in. Azure SQL has Machine Learning Services. Teradata has their Vantage ML engine.

Database technology that has been used in the past strictly for BI, is now fully capable of doing in-database machine learning as well, either as part of the analytical database, or as an add-on that works with the database.

So, How is That Data Warehouse Definition Wrong?

Here’s the definition again:

“Data warehousing is a technology that aggregates structured data from one or more sources so that it can be compared and analyzed for greater business intelligence.”

Data warehousing is not a technology.

Data warehousing is a process and an architecture. It certainly has an analytical database as the core of the architecture, and yes, that’s technology. But the plumbing of bringing data from a wide range of data sources, fast and slow, old and new, requires other technology, such as Kafka or some other streaming pipeline like Kinesis or Pulsar, and Informatica or some other ETL engine such as Apache Airflow or Actian DataConnect. A data warehouse needs some good visualization software on the front end like Tableau or Looker. Beyond that, it requires people to use all that technology. It requires business processes that use the analyses the warehouse provides to drive decision-making whether human or automated.

Modern data warehouses aggregate structured data, semi-structured data, streaming data, etc.

Modern data warehouses now handle data way beyond the carefully structured data of transactional systems. Structured data is some of the most important data in a business, but it is no longer the only important data in a business. Good modern analytical databases, the heart of data warehouse architectures, can handle all of that data.

Modern data warehouses support both business intelligence and data science.

Any modern data warehouse worth its salt still does business intelligence, now on larger, more complex data sets, still at high performance, and still essential to business decisions. However, that’s no longer all it does. Machine learning, predictive analytics, advanced analytics, data science, whatever you want to call it, the modern data warehouse does that as well, also still for the purpose of supporting essential business decisions and processes. If you’re wondering why you would do machine learning in a database, Waqas Dhillon, ML Product Manager at Vertica, and I wrote a post about that over on the Vertica blog. The main point is that getting machine learning into production is the biggest hurdle, and a database is the same environment in development, in test, and in production.

So, What Is a Good Definition of a Modern Data Warehouse?

I’m a fan of Martyn Jones and his capacity for shining a light on the industry’s foibles and fictions. I don’t always agree with him, but I could never argue that he lacks valid reasoning behind his views. Here’s his definition of a modern data warehouse:

“A data warehouse is essentially a business-driven, enterprise-centric and technology-based solution. And it is used to provide continuous quality improvement in the sourcing, integration, packaging and delivery of data for strategic, tactical and operational modelling, reporting, visualisation, analytics, statistics and decision-making.”

The interesting thing about this definition to me is that it’s accurate now, and it was accurate ten years ago. At its heart, the modern data warehouse is still a data warehouse.

Woman in small boat sailing on serene sea with bright sun — symbolic of modern data warehouses conquering big data challenges

--

--

Paige Roberts

27 yrs in data mgmt: engineer, trainer, PM, PMM, consultant. Co-Author of O’Reilly’s : "Accelerate Machine Learning" “97 Things Every Data Engineer Should Know”