For the sake of the readers who are hearing the term ‘Data Warehouse’ for the first time, let me explain briefly. A data warehouse is an integrated, non-volatile, subject-oriented and time variant storage of data to reveal trends, patterns, and correlations that provide valuable information and insight into the business. The focus of this article is to show you the importance of a cloud data warehouse in a data-driven economy. We shall take a quick look at various data warehousing concepts and evaluate two scenarios — structured and unstructured data to design our first data warehouse.
The Justification for a Data Warehouse
SalesLoft is experiencing a data tsunami with exponentially greater data volume and complexity, with data distributed across various databases, applications, domains or at various sources likes websites, mobile phones, and in different structured or unstructured formats. As a data-driven company, monitoring end-user behavior and gauging customer adoption means converting this raw data into useful information by cleansing, transforming, and integrating the data to become a single source of truth for the organization. This cleansed, transformed, and integrated data needs a unique, cost-effective single storage location and that location is the Data Warehouse.
Second, to perform any trend analysis, we need to study data between various time combinations of year, quarter, month, week, day, weekday-weekend. In this case, to facilitate decision-making using our historical data, the most suitable option is designing a data warehouse.
Lastly, machine-generated data typically has a poor signal-to-noise ratio. It may contain valuable data but also a lot of “noise.” This means we must store all of it to find the useful bits. Is it possible to store this data cost-effectively and consume in a meaningful way through analytics? This makes the cloud — with its near-infinite scalability, external storage capability, and seamless integration — a perfect place to host the data warehouse.
Business analysts, data analysts, data scientists, data engineers, managers, as well as top executives benefit from data warehousing through self-service business intelligence (BI). It is everyone’s dream in an organization — all the information about the organization’s activities gathered in one place, one source of truth available at a glance on a dashboard, with just a click of a button. It may seem like a distant dream, but how do we make this dream a reality? First, we have to plan our data warehouse system. Modeling your data warehouse is the first step in this direction.
SalesLoft’s core platform has thousands of users across various surfaces. Due to unavailability of integrated data, Account Managers, SDRs, CSMs, ISMs are unable to have a 360-degree view of their customers while making crucial decisions. They would also like to explore historical data to identify interesting and valuable connections. Currently, it is a manual, time-consuming process and doesn’t allow for understanding historical trends. So, we are asked to design a system which can help them quickly in decision-making and provide a return on investment (ROI). To start designing the data warehouse, we need to follow a few steps.
Developing a Data Warehouse in Cloud
The properties of cloud technology — especially scalability, cost-effectiveness, unlimited resources, natural integration points — make it particularly well-suited for data warehousing. With exponentially increasing data volume (user activity and machine-generated data), we chose Amazon Redshift to build our data warehouse. Redshift provides the advantages of columnar storage, Redshift Spectrum to decouple expensive compute from storage, and ingestion and query of both semi-structured and structured data formats.
Evaluating an ETL tool is quite a challenge. With a wide variety of options in the market, it is difficult to know where to start. We shortlisted three different ETL tools based on the capabilities to perform full load, incremental load with change data capture (CDC), process chains, job scheduling and monitoring, job success/failure notification to name a few most important criteria. With a very intuitive development interface, Matillion ETL for Redshift came out as a winner.
Last but not least, we had a pre-existing web-based BI tool ‘Looker’ to connect to the data warehouse, run queries, build charts and dashboards. We found Looker easy to use for both data analysts and business users.
From here on, if done right, we knew our data warehouse would be up and running in less than 3 months. Therefore, most of the time from here on was spent gathering requirements, building the ETL pipeline, and configuring ‘Looker’ to extract insight from the data warehouse.
The phases of a data warehouse project listed below are similar to those of most database projects, starting with identifying business requirements and ending with executing the ETL pipeline to create the data warehouse:
- Document Business Requirements
- Design the conceptual, logical and physical (dimensional) model
- Build ETL pipeline to create and populate the dimension and fact tables
- Visualize through reports and dashboards
In order to identify and collect requirements, we need to interview the key subject matter experts and stakeholders to know — what factors define the success in the business? What are the most important business questions which need to be answered by this new system? How does management want to visualize and analyze their data? Interviewing the experts in various departments provides realistic business requirements that are necessary for scoping the project.
Let us first identify and document the business requirements from key stakeholders to examine two interesting cases.
SalesLoft business executives are interested in knowing the Daily Active Usage of their platform. Our customers are scattered across several continents. They would like to examine the 2 cases listed below to gain visibility of active users across surfaces. The suspicion is that a large segment of users who do not interact with the core platform directly on a daily basis may be engaging with the extension regularly.
Case 1 to consume structured data from database sources
CSMs and ISMs need to visualize with historical timeline Daily Active Usage for the core Platform across various dimensions — Teams, Users, Account/Sales Owner, Salesforce Accounts. We would like to know the Number of Platform Sessions a User had during a given time period to understand how frequently they use the Platform portion of the application. This data is available at source in a structured format.
Case 2 to consume unstructured/semi-structured data from external S3 storage (decoupling compute from storage)
CSMs, ISMs need to visualize with historical timeline Daily Active Usage for our extensions Gmail & Salesforce (SFDC) across various dimensions — Teams, Users, Account/Sales Owner, Salesforce Accounts. We would like to know the Number of Gmail Sessions and Number of SFDC Sessions a particular user had during a given time period to understand how frequently they use the extensions. We use Segment to send users extension sessions related logs (raw data) to S3 destination in gzipped, newline-separated JSON format, binned by the day in the S3 bucket. Spectrum allows for consumption of external data without the need to increase the size of the Redshift cluster. Redshift Spectrum also allows us to write queries against data stored in an S3 data lake without having to load data into Redshift compute, thereby reducing the time to consume and storage cost.
Design the Conceptual, Logical, and Physical (Dimensional) Model
The conceptual design phase performs a transformation of the semi-formal business requirements specification into a formalized conceptual multidimensional schema. Logical Model identifies different Entities and Attributes and Relations between them.
Data modeling is the process of documenting a complex software system design as an easily understood diagram, using text and symbols to represent the way data needs to flow. The diagram can be used as a blueprint for the construction of the dimensional model. We used Vertabelo data modeling tool to design and store artifacts of our models.
Dimensional modeling creates a schema which is optimized for high performance. It means, fewer joins and helps with minimized data redundancy. This is why dimensional modeling is used in a data warehouse. We need to design Dimensional Model to suit business requirements which must be easily extensible according to future needs and also provide “instantaneous” query results for analysts.
Let us quickly take a look at a few essential terms used in the data warehouse and then we shall derive them for our requirement.
Dimensions (tables) contain textual descriptions about the subjects of the business. For example, Customer is a dimension with attributes such as first and last name, birth date, gender, etc. The primary functions of dimensions are to provide filtering, grouping, and labeling on your data. Out of many dimension types, slowly changing dimensions (SCD) are most widely used to capture changes over time and preserve the historical information. Every dimension must have a surrogate primary key (internally generated numerical value column) which finds further use as a foreign key in the dimensional model. This surrogate primary key behaves as a Dimension Key while loading the fact table. It is a best practice to name Surrogate Key column with suffix _SK in the data warehouse.
Conformed dimensions are built only once in the data warehouse. They deliver both analytic consistency and reduced future development costs because the wheel is not repeatedly re-created. They can be reused in multiple models across fact tables. A fact table contains measurable facts and foreign keys to the dimension tables. Fact tables are explained further down in this article.
To provide reporting consistency across subject areas and reduce the development costs, we elected to build conformed dimensions wherever appropriate. Date dimension is a perfect example of a conformed dimension with slicing and dicing of data across various reusable time slices such as year, quarter, month, week, day, weekday-weekend etc.
An Audit dimension is a dimension with a different purpose in any model. Audit dimensions are not updated as part of master data. Instead, they are the last dimension tables to be updated during the ETL process. They help with selective deletion of data from fact tables without much hassle.
Let us identify dimensions related to our case:
Users, Teams, Account/Sales Owner, Salesforce Accounts, Date, Audit
A measure is also commonly called a fact. Measures are data elements that can be summed, averaged, or mathematically manipulated. They contain quantifiable data, usually numeric, that can be aggregated. Mostly, measures exist as a column in fact table. A measure can also be derived by using one or more other measures.
Let us identify the measures or facts related to the above cases.
Number of Users is the total count of users populated with a value of 1 for every user.
Number of Unique Sessions indicates whether a User had a session in ANY of the entry points. For example, on any given day, if a user had logged in twice using Gmail and thrice using Salesforce and once using Platform, the number of unique sessions is considered only 1.
Daily Active Usage Percent in a time-slice is a metric derived by using the formula
(Number of Unique Sessions / Number of Users) * 100
A fact table consists of facts of a particular business process. Data in fact table are called measures. Fact tables have logical foreign key relationships with the Dimension tables. Fact table provides statistics for subject areas broken down by Users, Teams, Sales Owner, Account Owner, and Salesforce Account dimensions.
One important concept to understand here is the Granularity. It is a measure of the degree of detail in a fact table. The greater the granularity, the deeper the level of detail. This needs to be discussed with key stakeholders before designing the dimensional model. In our case, granularity at the Day level provides the required depth for business reporting. Fact tables can be separated by granularity and subject area. Fact tables have foreign key relationships with their respective Conformed Dimensions.
Let us identify what attributes should be there in our Fact Tables.
Model your Data
After the groundwork of designing the dimensions and measures is complete, we now have to use appropriate schema to relate these dimension and Fact tables. Popular schemas used to develop dimensional model are Star Schema, Snow Flake Schema, etc. However, which schema suits which business case is a design question for another article. The simplest and most widely used is a Star schema to develop data marts. The star schema resembles a star, with points radiating from a center. The center of the star consists of the fact table, and the points of the star are the dimension tables.
Please refer to the Star Schema figure below:
Coding & Development of the Data Warehouse:
Now that we have the list of dimensions and facts (tables) and their attributes (columns), we are ready to start developing the code to build our physical model step by step. It is a standard naming practice to prefix a dimension table with DIM_ and a fact table with FACT_ in a data warehouse.
Let’s follow the below steps to run the code in Redshift instance.
Create a database for your Data Warehouse by issuing the below command.
Create Date Dimension table which will create and populate Date data divided on various levels.
You may refer for detailed steps to article http://elliot.land/post/building-a-date-dimension-table-in-redshift
Download the script and run it in this database for creating and filling of date dimension with values.
Create dimensions — Users, Teams, Account/Sales Owner, Salesforce Accounts — as slowly changing dimensions. For details on various types of slowly changing dimensions, you may refer to https://www.datawarehouse4u.info/SCD-Slowly-Changing-Dimensions.html
Below is an example code to create a slowly changing type 4 dimension for Users. In this method, historical data is tracked using Begin_Date and End_Date for every record. Every attribute of this table is time-dependent and all historical changes are captured for trend analysis.
This step is specifically for accomplishing Case 2 to consume unstructured/semi-structured data from external S3 storage. We need to consume decoupled User sessions logs to derive whether a User had a session on any given day through Gmail or Salesforce. These logs are gzipped, newline-separated JSON format, binned by the day in S3 bucket. You may refer to Amazon Redshift Spectrum guideline IAM Policies for roles and authorizations required to create the external schema.
All external tables must be created in an external schema, which you create using the below statement.
Amazon Redshift Spectrum uses external tables to query data that is stored in Amazon S3. To consume data from the external table, we write same SQL syntax. External tables are read-only. We create an external table in an external schema. To create external tables, you must be the owner of the external schema or a superuser.
Run the below command to create external with partition.
The table is created with partitions on
receivedat. When we partition our data, we restrict the amount of data that Redshift Spectrum scans by filtering on the partition key. Since we query the external table data by the Day, the best partition is on
Now, the external data from S3 is ready for consumption through standard SQL in the ETL pipeline.
Create Fact tables to hold all your transactional entries with appropriate foreign key columns which refer to the primary key (surrogate key) column of your dimensions. In a data warehouse, foreign key referential integrity is preferred to be logical. Having active physical foreign keys on tables improves data quality but hurts the performance of insert, update and delete operations. This is one reason why we give up on foreign keys in data warehouses and analytical databases that do not process data in a transactional manner (one row at a time) but in bulk. Of course, performance is a crucial factor in data warehousing and business intelligence.
You have to take care while populating your fact table to refer to primary key values of appropriate dimensions. If no reference to the dimension can be obtained, you may populate with a dummy key such as ‘all 9s’.
SalesLoft has multiple users from different teams logging in either directly through Platform or Gmail and Salesforce extensions. However, let’s assume that on 1st June 2018, User A belonging to Team X had 4 Platform sessions, 2 Gmail sessions, and 3 Salesforce sessions. We want to determine the Daily Active Usage.
Before filling your fact table, you have to identify and perform a lookup for dimension key column values from respective dimensions, as in the example above and fill in foreign key columns of the fact table with the appropriate dimension key values. Now let’s define how we shall refer to the dimension key values from each dimension.
Populate your fact table with historical transaction values of any historical day with proper derived values of Dimension key values for each User for each Day.
After execution of the above INSERT script for each User, the fact table is filled with transactions for each day as displayed below.
In this real-life scenario, we design the ETL pipeline with transformations and orchestrations which are scheduled to refresh nightly all dimensions and fact tables, so that the data warehouse is ready for analysis and reporting. Now the data warehouse is ready for OLAP Business Intelligence through Looker.
As you can see, a data warehouse is not an easy undertaking, and it offers many challenges. Moreover, if there is a change in the business process or the organization’s policy, the data warehouse may have to reorganize. If a data warehouse is to be successful, it is a collective and continuous effort to vet the data. The biggest gains from data warehousing occur when it integrates into how people perform their jobs and the running of the organization.
Sonny is a Data Warehouse Engineer at SalesLoft. He has a Master’s Degree from Southern Illinois University. He lives in Atlanta with his wife and two children.