Building a data warehouse. Supafast.

Ben Chino
makipeople
Published in
5 min readSep 27, 2022

Aloha! Given our very data-driven nature — data is after all at the basis of psychometry — we decided to invest quite early on into a solid data stack. Some of you might remember, but in our previous venture Everoad, we built a Data Warehouse in six months and described it here. Maki world being a bit different, we decided to build it in six days instead. Here’s how we did that.

Start with a clean and robust data model

The foundations of a strong data warehouse lie on its underlying data model. Building a reliable and performing warehouse if the underlying model is shaky makes little sense as you would then spend your time patching for root changes, at several levels.

Imagine if Paul Bocuse were to cook three-star dishes in a clumsy kitchen

Luckily for us, our back-end team did a fantastic job creating a data model that eliminated redundancy, prevented inconsistencies, ensured integrity and most importantly, that scaled over time. And it was not a trivial thing considering the layers of data we manipulate on a daily basis -

  • Layer 1 (structural): our tests consist of thousands of items, and for each items, multiple answers.
  • Layer 2 (contextual): our assessments are made of contextualized tests containing a designated set of items and answers to ensure fairness across that specific assessment cohort.
  • Layer 3 (individual): each candidate is shown a particular set of items and answers drawn from the contextual pool to prevent cheating.

As you can imagine, those three levels of data created a real challenge for business teams to easily retrieve data points and work on those. Hence the need for a smart data warehouse.

Setup your data environment

The Data platform market is very hot right now — and some of the most valued companies offer exactly that, like DataBricks or Snowflake. But given we’re still a tiny startup trying to spend money carefully, we decided to continue leveraging our core infrastructure (GCP) and rely on BigQuery for warehousing and DataStudio for vizualisation. So for those of you who are considering setting up the fanciest data stack early on, follow the ‘ramen-entrepreneur’ philosophy and start scrappy first.

The big question was therefore how to get from a read-replica Postgres database to a sophisticated data warehouse. We were initially thinking about going the old way i.e. setting up Apache Airflow to manage data engineering pipelines ; but with no in-house data engineer and with the need to focus on our core offering, we started investigating ways to automate pipeline management instead. Enters Hevo.

Hevo is an end-to-end data pipeline platform that enables you to easily pull data from all your sources to the warehouse, run transformations for analytics, and deliver operational intelligence to business tools.

Create your raw replica using pipelines

The really cool thing about Hevo is that you can easily set up data pipelines between sources and destinations through a simplistic UI. No code needed!

Let’s take an example ; imagine you aim to replicate raw replicas in BigQuery:

  1. Simply configure your source — that is your initial database, preferably a read-replica — by granting access to Hevo.
  2. Then configure your destination — that is the warehouse where you want to send the data to, in our case BigQuery.
  3. Map the tables and fields you want to replicate, or trust Hevo to auto-map and ingest automatically.
  4. Define the frequency or time(s) at which the Pipeline must ingest the data from the Source.

And let the magic happens ; pipelines will ingest the data automatically and you’ll be able to track progress on each table using the Hevo console. And as a bonus, logs will help you understand what might go wrong in the process. In the example below, some fields were null where they shouldn’t.

Transform data to power operational needs

But the biggest value of Hevo lives in its ability to transform the data initially loaded into a form conducive for a BI tool to perform analytics and reporting. To do so, one can simply perform SQL queries from one or several tables to build the required data model (dimension and facts) and schedule runs. No python required anymore!

Managing pipelines using SQL models

At runtime, Hevo executes the query against the data warehouse, and the results are exported to another table in the warehouse. So, considering a data warehouse consisting in a first set of replicated raw tables, you could easily create your fact and dim tables using SQL queries to provide business teams with all the data they need to perform their tasks in the easiest possible way. And this is exactly what we did at Maki powering up millions of complex data sets that our psychometric teams, our business teams as well as our support teams leverage on a daily basis.

Et voilà. You now have the recipe to create your very own Data warehouse, supafast. Don’t hesitate to let us know if you have any questions and till then, happy warehousing!

--

--

Ben Chino
makipeople

Co-founder & CPO @ Maki. Former VP Product @ sennder & Uber alumni. Fan of black t-shirts, red wine & rock music.