Four building blocks for scaling insights — Part 2: The evolution of our insight infrastructure

Published in

Oda Product & Tech

10 min readJun 19, 2020

A modular approach helped us scale our insight infrastructure as we grew rapidly and went from start-up to scale-up. Here is the story of how it evolved from basic manual reporting to a modern self-served insight platform. This is part two in our blog series “Four building blocks for scaling insights”. You can read the first part “The embedded model” here.

Early start-up phase

When I started in Kolonial.no in 2016, we didn’t have a data warehouse. Most of the reporting and data were provided through our Django monolith. As a developer, it was simple and joyful to do everything in one system. Web shop, warehouse management, fulfillment, distribution, CRM, newsletters, campaigns, supply chain management, you name it. The monolith contained software for everything an online grocery store needed to operate. I’ll probably never feel as productive as I did back then, working in a small team on a monolith.

Any ad-hoc data questions had to be handled by developers. As usual in a start-up, things were often burning and the faster we could deliver the better. For a long time our monthly financial reporting was a Jupyter notebook I had on my laptop that read directly from a production database. Typically, you were working on a critical feature or fix when someone said “Hey, can I have those numbers by tomorrow?”. The solution wasn’t pretty, but it was a simple and fast way to deliver numbers. It freed up time we could spend on more critical stuff.

After a year we had grown a lot. Both in terms of customers and employees. But the access to data was still through a developer, unless the question fit one of our pre-made reports. People were constantly asking for data and we were not at all able to respond to all requests.

Developers were the bottleneck for providing data.

Self service

We needed a solution where people could explore data and ask questions themselves without the need for writing code. Spending developer time on serving people with data was obviously not productive in the long run. After some research I found an open source tool called Metabase that seemed to solve the problem we had. It supported ad-hoc analysis without writing code and had decent dashboards and visualization capabilities. And the way it worked was simple: put your data in a database and point Metabase to it. All it did was to execute queries directly against the database.

I couldn’t wait to make some of our own data available in Metabase, so I started on a prototype over a weekend. As a starting point, I wrote a few export jobs in our monolith that transformed and exported data to CSV files. The files were loaded into Postgres simply with COPY statements and finally Metabase was connected on top of this.

And with that, our own data was suddenly easily accessible. Revenue graphs broken down on your own choosing with just a few clicks. Ad-hoc queries at your finger tips, no code required. The potential this would unlock felt enormous at the time. I was so stoked to show this off at work.

The first version of our insight infrastructure.

And people loved it. Finally they could explore data and ask questions themselves. But it didn’t take long before they wanted more. “Can you add this field?”. Sure we could. But, first we had to write the code, then do a deploy to our monolith and finally re-export and re-load the affected data. A cumbersome and timely process. It wasn’t easy to add data or make changes in the current system and it quickly became the next bottleneck for people to get the data they were seeking.

Leveling up transformations

As one typically does when having no clue how to solve something, I started searching the internet for a solution to our problem. Big, heavy, enterprise solutions showed up. But we didn’t want to change everything. Postgres and Metabase worked just fine at the moment. Our problem was the time it took to make changes in the transformations and add new data. Not the database or the user interface itself.

That’s when I found dbt. Data build tool. It took some time to understand what it was, but when I first got the idea behind it, it immediately felt like what we needed. With this open source tool we could simply import a cleaned dump of our whole production database into Postgres, and then do all the analytical transformations inside the database. By keeping most of the raw data inside the database, there was no need to re-export and re-load data whenever changes to existing transformations had to be made. The process of making changes was reduced to editing the SQL for only the transformation part of the pipeline. This workflow could have been implemented without dbt of course, but the work required was significantly reduced with dbt. It also included an awesome set of features that made for example testing easy.

Our infra engineers already had a solution for cleaning, dumping and restoring the database, so that part was taken care of. To replace the current pipeline, the main job was to migrate our analytical transformations from Python to SQL. I was skeptical at first. I was by no means an expert in SQL. But the more I got into it, the more I liked it. For our transformations, the code turned out to be clearer and easier to understand. As Metabase only depended on the schema for the final analytical tables, we could swap the source without notice. For the end user, things worked just as before. The real power and value was that we now could do changes in a matter of minutes. No need to extract or load data. We just had to re-run (which meant executing SQL) the affected model. Make changes was no longer a bottleneck.

Real-time data

We were still growing fast: lots of new customers generating data and more employees asking for data. We had also hired data scientists and analysts who soon became the main contributors to creating transformations and content. Not surprisingly, the next bottleneck to show up was the time it took to load a full database dump and run the transformations. Our morning reports were often delayed, and if something interrupted the process, data weren’t available before late in the evening. In addition, the demand for real-time data had increased.

Was there a way to replicate data real-time between two Postgres masters? Pglogical is an extension to Postgres that let’s you do exactly that. You set up the source database as a provider and the target database as a subscriber, connect them together, and almost magically, you have data replicated with sub second delay. By replacing our batch pipeline (dump and restore) with pglogical, we could not just provide data more reliably, but also in real-time.

We decided to set up a proxy database that served as a layer between the production database and the analytical database. This would also simplify the process if we later decided to switch data warehouse and use an off-the-shelf tool to load data without connecting it directly to our production database.

A new challenge we faced with this switch was schema changes. Previously, they had been handled by recreating tables every night. Now, as we were streaming data, we had to handle schema changes in-flight. Also, having to manually execute SQL statements to add new tables for replication, was not a smooth process. To fix that problem, we wrote a small Python application that does two things: It keeps the schema in sync by automatically applying the schema changes that occur in production, and it keeps control over which tables and columns to replicate by watching a config file.

Our “third” version of our data infrastructure.

Migrating to Looker

At this point we had more than 100 users in Metabase. It wasn’t any strict central governance and things had just grown organically. People were responsible themselves to create useful content and name things properly. And it was total chaos. We also had performance issues with Metabase (the application itself) and people who spent much time in it got more and more annoyed. We had grown out of Metabase. (That said, I can recommend Metabase, especially to start-ups.)

After some research and testing, we decided to go with Looker. With a metadata layer between the analytical data and the end users, Looker provides a way to curate the data. It requires more work by the data team, but reduces the complexity for end users. Errors are less likely to occur as the data team can more easily control how the data is used. For us, it seemed like the right tool based on our needs and previous experiences.

As most modern business intelligence tools, Looker doesn’t store any data. We could point it to our Postgres data warehouse and start to create content right away. We wanted to have all our business logic in one place, so we kept everything in dbt and primarily used the metadata layer in Looker to curate and organize how the data was exposed to end users.

Migrating to Snowflake

For some time though, we had experienced more and more performance issues with Postgres. Our data volumes had increased significantly and it took more and more time to make sure Postgres performed reasonably well. Even though it was solvable, spending time on tuning Postgres itself didn’t create any value for the business. If we could avoid that, we would free up resources that could be spent elsewhere.

As a replacement, we narrowed the alternatives down to Snowflake or BigQuery. Both fully managed cloud solutions and supported by dbt. After some initial testing, we found that Snowflake performed better for us on our most typical queries. It would also require less changes to our SQL code as it was more similar to Postgres than BigQuery. Thus, we decided to go with Snowflake.

With Snowflake, we needed a new way to load data into the warehouse. We tested a few alternatives and found that Fivetran had the features we needed and performed most reliably with our data volumes. By using an off-the-shelf solution, the data engineering required was reduced to a minimum. We didn’t want to have a third party connected directly to our production database, so we connected Fivetran to the proxy database described earlier. This also served as a layer were we could filter out tables and columns we didn’t want to expose to a third-party, which simplified the privacy assessment.

With the raw data loaded into Snowflake, the job for our data team was to port all our current SQL transformations from Postgres SQL to Snowflake SQL. After running a Perl regex script, about 150 of the 300 models ran out of the box on Snowflake. The rest was mostly straight forward changes, except a handful of complex models that relied on specific PostgreSQL features not available in Snowflake. We also took the time to improve the overall code quality and made sure we followed naming conventions. All in all it took us three weeks to switch Looker from Postgres to Snowflake. As Looker was only depending on the schema of the underlying data, we could simultaneously create content in Looker and do the switch without other notice than performance improvements. The large job of creating useful content in Looker and training employees to actually become self serviced, were still ahead of us (topic for a coming blog post, stay tuned!). But at this point, we felt confident that our stack would both scale and be flexible enough to adapt to our future business needs.

The current version of our insight infrastructure.

A modular infrastructure

Most businesses, especially startups, operate in a very dynamic and rapidly changing environment. The data stack should reflect that. A modular approach to our data infrastructure enabled us to tackle the bottlenecks we were facing without having to replace the whole infrastructure. We could do it step by step as we were growing and our business needs and priorities changed. Also, the data landscape is changing fast and it’s difficult to keep an overview over all solutions and technologies you could possibly invest in. I think the UNIX philosophy applies here (as many other places). Choose solutions that do one thing well. It makes it easier to replace solutions later and reduces vendor lock-in.

And be careful where you put your business logic. The way I see it, we have mostly invested in expressing our transformations in SQL. Changing that will be time consuming, but I consider that a safe bet given the history of SQL and it’s current popularity. The database, the way we move data in and out or the tools we connect on top of it, can be changed as we see fit. I consider that a huge strength for our data infrastructure.

Coming up: Best development practices

An infrastructure alone is not enough to achieve self-served insight. To be able to effectively make changes, you need a good process as well. In our next post, we will go into detail about our development practices that helps our data team to work effectively.

We are hiring!

Please check our open positions if you would like to join an awesome team on a mission to create freedom and value for our customers. Read more about what we are doing here and feel free to reach out if you have any questions!

Thanks

This post was made possible with help and feedback from Nina Walberg, Andreas Sola Fischer and the Data & Insight team at Kolonial.no. A special thanks is sent to Fishtown Analytics who created an open sourced dbt. Their work and the dbt community has inspired much of how we work with analytical engineering.