Getting Your Data Warehouse Right: 10 Best Practices
So, your company’s data is a mess and you can’t answer simple questions like
- How many customers do I really have?
- How effective are my marketing channels?
Someone high up in the food chain has come up with the brilliant idea to “build a data warehouse” and lo and behold — you’re the lucky one chosen to build it!
Simple, right? After all, everyone these days uses data warehouses. You do a bit of Googling… and holy smokes, you end up with 1001 technologies, architectures, platforms and services to choose from. Where do you start? You are even more confused than before.
This article will provide experience-born best practices for anyone creating reporting and analytics-focused warehouses. This is not a final blueprint and may or may not be 100% applicable to your business. Rather, it’s a simple framework to avoid some costly mistakes. So, let’s dig in!
The Basics: Stick to the Cloud
This is so obvious it doesn’t even take space in my top 10. Unless you are the CIA and need your own private infrastructure, you have no business owning your own metal. The cloud isn’t the future but the present and the norm at this point.
1. Start With “Why?”
Why do you really need a warehouse?
- Is it to create a bunch of reports for monthly board presentations?
- Is it to give your users a tool for their own analysis without writing SQL?
- Is it to train and deploy machine learning models?
- Who’s your audience? Is it internal only, or do customers and vendors have access?
- Is it simply because you have so much information and it’s time that someone did something useful with that data?
- Is it to maintain a golden copy of your data for operational purposes? For example, to consolidate inventory levels and prices from multiple ERP systems and expose them in real-time to your customers?
By the way — that last item on the list is technically not a warehousing solution, but more a data orchestration exercise that needs to be solved slightly differently with different technologies.
These are the kinds of questions you should ask all your stakeholders right upfront. You need to be crystal clear about your objectives, which will have huge implications on your final architecture.
2. Avoid One Stop Solutions
This is controversial and might ruffle some feathers. But it’s very important!
There are many one-stop solutions like CDP’s (Customer Data Platforms) that have popped up in the last 2 years and claim they’re the be-all-end-all solution for managing your golden copies and reporting. 😏…we’ve heard these claims before.
The reality is because they are generic solutions (with perhaps a few options to customize) there is no off-the-shelf product that will suit all your needs — sorry! The promise of a “quick-start” will be enticing though. But you will have to compromise from day 1 and pick the product that comes closest to your budget and needs. As time goes by, your needs will change and these solutions will be very inflexible to adapt. When that happens, your internal teams will start saving data in spreadsheets and napkins. And very soon you will end up having the same mess, plus a new expensive tool.
But there is an even bigger problem with these one-stop-solutions: data ownership. Which brings me to my next point:
3. Whatever You Do, “Own” Your Data
I am a strong believer that you should own your data.
If you observe the history of technology in the last 40 years, you see that every few years there is always a new-kid-on-the-block claiming that he is the true Messiah. But the reality is that in 3–5 years you need to be ready to move on to more advanced technology. Treat it like you would an ex-boyfriend: break up, take all your belongings (in this case the data) and leave cleanly.
Any solution that doesn’t allow you to access all your data directly should be a No-Go. If they say things like ….
- “you can set up reports to extract the data that you want using our reporting tool,” or
- “you can easily export any data that is 3 months old. If you want older data, please let us know and we can get it to you very quickly.”
…. RUN
Don’t let anyone hold your data hostage.
4. Don’t Overcomplicate the Architecture
At the core of it, data warehousing is quite simple. While there are 100’s of choices and 1000’s of tools available, any near-real-time data warehousing system only has the following 3 layers (The DB’s are not considered a layer in this context):
- Extract and Load — Extract & load data from source systems into a data lake (a semi to unorganized data storage)
- Transformation — Choose, transform and structure selected data from the data lake into the data warehouse
- Presentation — Setup the data for reporting, business intelligence and analytics
How hard can it be? Here’s what it looks like from 30,000 feet:
Of course, things can get more complicated with additional requirements. But, this is a good starting point where you can add advanced requirements as needed.
One such common complication is the need for “real-time” data. I come across this with my clients all the time. Before over-complicating your life, please ask stakeholders and users this question multiple times:
“Do you really need real-time or would a short lag (for example, 15 minutes) be good enough?”
95% of the time the answer is “15 minutes is awesome”, because most clients are still struggling with one to two-day lag data. Even if the answer is “No, I need real-time data”, I would suggest setting up a near-real-time environment first. Then see how many use cases are not satisfied.
5. Avoid Coding. Use Tools…
One of my clients (a startup in NYC) had a really smart data team who wrote excellent code. Over the course of about 3 years, they wrote tons of great ETL code in Python. But unfortunately, over the course of 3 months, the company lost its entire data team. The business had evolved by then and the code was no longer working as well as it used to. The lesson: smart engineers don’t always equate to the best solution. And, you don’t need to write code from scratch for everything. Nowadays, there are enough 3rd party SaaS tools (hosted solutions) available for a small fee that can replace the need for coding and eliminate a lot of future headaches!
For example, Extracting and Loading tools (aka Data Pipelines) are so good these days that you can have the pick of the litter from free all the way to tens of thousands of dollars a month. (ex: Stitch, Alooma, FiveTran, etc). You can easily find a solution that fits your budget constraints, performance needs, and support expectations.
Avoid coding to avoid the maintenance headache down the line.
6. …But Use Tools Only for Their Intended Purpose
There are many legitimate fears (investing in new technology that becomes outdated in a few months, for instance) in choosing the right tool. Especially when there are so many different SaaS solutions with clever marketing teams behind them. But if you think of selecting tools based on the 3-layer architecture I mentioned above (1. Extract & Load, 2. Transformation, and 3. Presentation) and deploy at least one tool per layer, you should be able to minimize the pain when you have to part ways down the road.
For example, Some BI tools claim to have “easy” ways to transform the data. Say you use these transformation capabilities to create several derived tables (Customer Master or Product Master). Well, six months down the line if you have to part ways with your BI vendor for some reason, now you will have to find a tool that can not only handle the BI part, but also the transformation. On the other hand, if you had used your BI vendor only for its intended purpose (the presentation layer), then you would have had plenty of choices to swap with much easier execution. See? Problem solved!
7. Democratize Data Consumption, but Regulate Data Creation
The data democratization revolution of the 2010–2012 era got things terribly wrong IMHO. The concept of all business users having access to any and all data went too far (but made some BI providers multi-billionaires). Yes, business users should have access to consume data. However, they should not be allowed to create their own data willy-nilly. They should not be allowed to bring a list of tags (ex: product features) through Excel, create a report and publish it to the entire company. This will set you up for failure.
Data governance is critical and needs to be a centralized function. Determine whether it’s worthwhile to invest in a data librarian team, which would be responsible for vetting and regulating all data points and KPIs.
Once the data governance is in place and you’ve regulated data creation, you can liberate the business users by setting up a BI tool where they can do their own analysis by simply dragging and dropping variables.
Be “Stalinist” with data creation but liberal with data consumption.
Which brings us to my next point…
8. Choose your BI/ Reporting Tools Wisely
Most data consumers in a typical company are non-technical business folks who will never write a SQL code in their life nor care about the underlying database tables. A good user-friendly interface is a key to winning this demographic. Based on your use case, it could be DIY Business Intelligence platforms like PowerBI or Looker, or a simple scheduled PDF reporting solution. But getting this part right is critical to the success of your data warehouse.
Choosing the correct BI/ Reporting tool can be overwhelming. Especially with the plethora of choices in the market. This is a topic for another time, but start by understanding the following aspects better:
- Audience (data creators vs manipulators vs consumers)
- Use cases (are they more PDF-based reports or company-wide dashboards?)
- Budget
- Internal support (do you have a dedicated insights team?)
- Flexibility
9. Account Plenty of Time and Resources for Change Management
Business users hate change. Even if you build the perfect data warehouse and the perfect BI/ reporting system, you will have someone on your team who will refuse to use it, because the company logo now appears on the right side, and not the left. That is not an exaggeration! I’ve seen worse.
Allow ample time to make the transition easy for everyone. Here are a few tips:
- Make it a big deal — get the C-suite to talk at town halls, create competitions for the best 5 dashboards, etc
- Create a great landing page with the most relevant stuff, like company dashboards, right upfront
- DOCUMENT, DOCUMENT, DOCUMENT
- Explain KPI’s and company-wide definitions in detail with examples
- Have training sessions — record them and have them available on the home page
- Create starter kits. For example: “start here if you want to analyze refunds…”
- Send weekly emails and hold office hours
- Create dedicated Slack channels and ticketing systems
Another important aspect of change management is breaking bad (legacy) habits of your internal data and tech team(s). If you are thinking of building your data infrastructure, whatever you’ve got now probably isn’t working — and the current team may not have the correct expertise. Therefore, account plenty of time for hiring the right people, training the existing, and putting guardrails to prevent bad legacy practices from coming back.
10. Construct the Right Team
Development and maintenance are two different things.
During the initial development phase (3–6 months in a typical case) you will require a seasoned data/ system architect, and at least a few heavy-duty data engineers.
The architect will be responsible for assessing your existing environment, understanding your future business goals, and coming up with a detailed plan on how to execute the perfect data ecosystem tailored for your needs. Someone who just received their AWS system architect certification last month and read a couple of articles, unfortunately, won’t cut it! The ideal architect is someone with a decade of hands-on experience who has seen the data world evolve over time. The data engineers you need at the beginning should also have some experience doing this.
“But data engineers and super senior data architects are expensive,” you say. And oh, by the way, they get bored and leave for more exciting projects.
This is why, for most companies, it makes perfect sense to contract an external team with experience during this initial setup phase rather than hiring full-time staff. Overall it will be cheaper, more efficient and faster.
Once the basics of the warehouse are set up, the maintenance of the warehouse and the support for the BI/ reporting tools become more straightforward and less technical. Hiring data analysts with business domain knowledge coupled with a junior/mid-level data engineer to attend to the minor technical changes is a good starting point.
In Conclusion
Centralizing, understanding and owning your own data is a good thing. And creating a data warehouse that facilitates this is a no brainer and is a strategic necessity in today’s market. But getting it right requires careful planning and eliminating all noise. Take a deep breath. Use the framework outlined in this article to simplify some of these complexities.
If you managed to make it all the way through, leave a comment, ask questions, and let me know what you think!