If you are in the job market, and started your career 25 years ago, and your main expertise is data warehousing, a data warehouse engineer position is something that’s hard to find. Searching several jobs sites for the keywords “data warehouse engineer” will return tens of jobs postings. But search for “data engineer” or ”big data”, and you will get thousands. The word warehouse is omitted from the job definition.
Does this mean that data warehouse methodologies are not relevant any more? And what’s the difference between a data engineer and a data warehouse engineer? First, let’s define “data warehouse”. The most popular definitions came from Bill Inmon and Ralph Kimball who are both recognized as the fathers of the data warehouse concept.
Bill Inmon provided the most popular definition of data warehouse:
“A data warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of data in support of management’s decision making process.”
Ralph Kimball is known by his dimensional modeling and for many years his models were considered the standards in the area:
“A data warehouse is a copy of transaction data specifically structured for query and analysis.”
Even with this clear definition, much confusion continues to exist between a data engineer’s responsibilities and data warehousing.
If we look back a few years most companies once used the traditional data warehouse vendors like Oracle and Teradata. These provided expensive solutions that required long term commitments from companies. As a result of data explosion and data complexity, new startups found that the commitment to the big vendors did not make sense while new and cheaper solution started to pop up using cloud technologies. In fact, none of the startups today spend time and money to build traditional data warehouse’s and it feels like data warehouse methodologies have been abandoned.
Many junior data engineers don’t know what dimensional modeling is. Do they know what a slowly changing dimension is? Do they know the difference between a star schema and a snowflake schema? What is 1NF,2NF,3NF…? What is a data mart and whats are the benefits of building data marts? Is it really critical in today’s new technologies?
Data pipelines using Spark/Kafka, Hadoop, Presto, Redshift and Airflow are the main focus of today’s data engineer job description and building a scalable data pipeline is challenging. Using these tools is challenging by itself and requires lots of technical understanding but can it be done without the understanding of data warehouse building blocks? Will building just the pipelines ensure that business can get value from it?
Reviewing the definition of data warehouse that was done many years ago just makes it more clear that understanding of data warehouse principles is critical for the success of a mature organization but with the recent technologies and data growth the implementation of data warehouse needs some enhancements mainly for the technologies it used’s to consume and transform data. Data engineers need to know the building blocks of good data warehouse as well as being able to use the new technologies to consume data. Separating data engineers from data warehouse methodologies can cause lots of future challenges. The main reason for building data pipelines is to be able to provide value to the organization by helping with business decisions and serve as the “eyes of the organization”. Data engineers that build data pipelines without understanding the business value are similar to PG&E building a gas pipe without a way to connect the pipe to our homes. Before starting to build a data pipeline define the business benefit and business objective. What are the KPI’s you want to retrieve from that data flow? What will be the file structure, delimiters, format, etc…What transformation needs to happen and what is the cleansing strategy? How will the underlying reporting know about new data or that existing data needs to be reprocessed.
Let’s play with a typical use case. Our data engineering team asks for a data pipeline to bring visit information and impressions that have been exposed to end users. The information is in JSON format and we get files every minute into our S3 bucket. The volume of data is 2–3 TB every hour. Now, the data engineer needs to make several important decisions about the file structures, data type, data cleansing, file format and so on.
The consideration above is similar to dimensional modeling, but at the file level and having the correct data in files can save tons of time and resources in processing the data later. For example, adding a visit id to an impressions file can save a lot of time in reporting. Thinking about similar cases will save development time and reduce complexity of reporting and data warehouse.
The decisions a data engineer takes today are very similar to what traditional data warehouses used to do. The only difference is the technology and the amount of data. Avoiding mistakes at the data pipeline development level -can save lots of time later in processing, cleansing, backlogs, restates etc.
There is no question that data has grown and data collection and processing technologies have changed significantly, so what is the role of the data warehousing expert in this new world?
A data engineering job description must include having a deep understanding of classic data warehouse approach and methodologies. Ignoring the data warehouse component might cause lots of pain as the company grows and will make development of data marts and reporting complex and expensive. On the other hand a data warehouse engineer must have deep understanding how to build efficiently data pipelines using modern technologies.
Utilizing both, new approaches for data pipelines, and classic warehouse methodologies is the key for success, building data and reporting within data driven organization.