Things to consider before moving to a data warehouse

Allen Chen
DataBlade Blog
Published in
3 min readApr 25, 2016

The holy grail for every company is to be able to have a single, continuous view of their customer lifecycle at every point of engagement. Traditionally, this required building costly and cumbersome ETL pipelines to centralize customer data into a data warehouse. However, by combining newer tools like RJMetrics Pipeline and Segment Sources with affordable cloud data solutions like Amazon Redshift, it has become easier than ever to consolidate your data.

Unfortunately, migrating your organization’s analytics workflow to one centered around a data warehouse does come with a few caveats. Here are a few things to consider before taking the leap.

Raw data is dumb

One advantage to moving to a data warehouse is getting access to raw, event-level data. Raw data allows you to do far more sophisticated and accurate analysis, especially compared to accessing that same data via an API. The challenge with dealing with raw data though, is that raw data is dumb.

One particularly useful dimension supported by the Google Analytics API is Social Source Referral. This field simply indicates whether or not a session came from a social network source. However, if you look at the schema of Google’s raw data export, you won’t find a similar field. This is because the categorization of referrals is not a fundamental attribute of sessions. It is a derived attribute, created for the convenience of analysts, applied after the fact by Google. When you move to accessing raw data, you will have to rebuild this type of functionality for yourself.

So while a 3rd party’s API may often feel restrictive, it does provide a highly context-aware mechanism to access your data. When you move to raw data, all that context is lost, and the burden of re-applying those semantics is placed on the analyst.

Joins are still hard

The first step to getting a more holistic view of your customer journey is getting your data into a single place. However, many overlook that there is a very crucial (and arguably even more difficult) step #2: having robust, common keys to join your data with.

Even if you get all of your data into a single data warehouse, without common keys, you will have no way to join data across tables. What makes things even more challenging is, common keys must be implemented at the source.

Do you know if your site is properly identifying your users to your analytics service? If so, have you considered using a stable, internal identifier as opposed to an email address? Remember, email addresses can change! Are all your email links properly tagged? These are all things you need to consider before data ever makes it into your data warehouse.

How SQL-savvy is your organization?

Question: How do you turn a clickstream into user sessions using SQL?

Answer: It’s not pretty

Accessing data in a data warehouse requires at minimum a basic understanding of SQL, but doing anything particularly interesting requires a much deeper understanding. Most services like Google Analytics and Salesforce provide convenient drag-and-drop report builders with deep functionality, and while it might be a hassle to clean and join all that information in Excel by hand, it presents a powerful workflow that has a relatively low technical barrier to entry.

Reconfiguring your workflow around SQL can be a significant undertaking, and while it may prove to be a worthwhile endeavor for your organization in the long term, it’s worth assessing the SQL skill level of analysts on your team to help you better understand the upfront cost of making such a workflow transition.

DataBlade is building a new way for organizations to easily access, aggregate, and automate their disparate data sources. Sign up for a free demo!

--

--

Allen Chen
DataBlade Blog

Cofounder & CEO of @databladeio. Formerly of @newscorp, @CloudPressInc, @amazon. Consistent coffee drinker, occasional cyclist, photographer, and guitarist