A Brief History of Liv Up’s Data Platform

Published in

Liv Up — Inside the Kitchen

9 min readJun 2, 2020

This is the journey of building a data platform, from 2017 to 2020. During these three years, our business changed a lot. Our team grew from 35 to 500+ people, we expanded from two cities to thirty, we evolved from processing a few megabytes to terabytes of data every day, we raised three funding rounds. As a result of continuous change, our platform had to evolve accordingly.

Part 1: The (Not So) Good Ol’ Days

In the beginning, there was only an inexperienced intern and a nice challenge: leverage Liv Up’s small data to support business decisions through analysis, automate reporting, and start building the foundations of a data team (we called it Business Intelligence).

With great support from Tristan Handy newsletter and his great Guide to Analytics (in 2020 I also recommend this post from Locally Optimistic), we built the very first version of the platform.

By that time, our main database was MongoDB and we had other data sources such as Zendesk, Facebook Ads, Google Analytics, Google Spreadsheets, Mailchimp, etc. We decided to build an ELT architecture, using Stitch to extract data from all sources (including MongoDB) and load in a PostgresDB, acting as a data warehouse.

Then, we transformed data using Pentaho Data Integration (PDI), using the drag-and-drop graphic user interface to build our core tables and create visualizations in Metabase. In some cases, PDI built-in transformations weren’t enough, so we developed Python and SQL code, and used PDI as a scheduler.

This was a simple, amateur, nonscalable platform. We didn’t even use git, most tools were no / low-code. Postgres and Metabase were hosted on Heroku (it doesn’t make much sense, very expensive and you can do as easily on AWS or GCP). Pentaho was running locally in a notebook on Windows (yeah, not so good old days).

It was an effective platform, though. We built dashboards to monitor main business metrics, integrated several data sources into a single source of truth, and developed simple analyses to support business decisions. We got the basics: automated reports and metrics. It was also cost-effective, as we moved from Heroku to AWS, all costs together summed less than $200/month.

This architecture worked for around one year. As we grew from 35 to 100 people, this version didn’t scale. It was a nice first version, but it needed to evolve.

Transforming flow (DAG) in Pentaho Data Integration

The main problems our platformed faced:

PostgresDB wasn’t scaling anymore, with queries and transformations scanning gigabytes, we had constant locks.
Transforming data in PDI was a mess. Huge transformations with many steps, we could take days to change a single field. And it simply didn’t work when Windows shut down 😂.
Metabase was doing great, but we started having problems with governance. As anyone could define their metrics, we started seeing two numbers for the same measure.
PDI isn’t suitable to run Data Science Jobs. As we were developing more models, we needed a better tool that could handle larger loads.
Google Analytics is fine for simple monitoring, but it is very limited for granular analyses or event processing.

Part 2: Adopting a Modern Stack

By the end of 2018, we could invest more heavily in our data platform. We hired a data scientist and a data engineer to help us build the next phase of the data team.

Data Warehouse

The first decision was to move from PostgresDB to BigQuery (BQ). After many talks with Redshift, Snowflake, and BQ users, we decided to go to BQ and adopt Google Cloud Platform as our main cloud provider. The decision criteria were:

Serverless. We had a small team and didn’t want to manage servers.
Cost. It couldn’t be extremely expensive or have high fixed costs.
Scalability. We didn’t need extreme scalability, but we didn’t want to think about it again in the foreseeable future.

Other nice to have features from BigQuery for us:

JSON compatibility. It helps with a MongoDB source.
Google Sheets integration. Read and write using federated queries.
Documentation. As with most GCP products, docs are awesome.

Transforming Data

We followed the trend of transforming data inside the Data Warehouse. We write only SQL code, leverage the BigQuery engine to transform data and simplify the process. Our analysts can contribute without deep data engineering knowledge, only SQL and a tool to manage everything else. There are two options: Dataform and dbt. We decided to go with Dataform as it has a nice GUI and many testing, advanced features, and great support.

Transformations in Dataform, every box is a SQL transformation

Today we have 500+ transformations in Dataform, every new transformation is code reviewed and requires testing for key uniqueness, value range, or completeness (Dataform has an assertion feature to easily test transformations).

Automated tests verify key uniqueness, self documentation is sent back to BigQuery as metadata

Data Visualization

Looker was a natural evolution for Metabase. Its modeling layer guarantees governance while using the same principle of using the data warehouse as the query engine. LookML guarantees consistency of metrics around the company and the constant evolution of the data model. We can easily build and share dashboards.

Metabase has limitations connecting to BigQuery. The native connection can connect to only one dataset (schema) at a time and it doesn’t handle array or struct unnesting.

Orchestration

Although SQL is awesome to transform data, we needed Python to build some transformations, integrate with APIs, or run data science jobs.

Airflow is top of mind when the topic is orchestration, but we were aware of scalability issues and problems with operators. After reading this awesome post on why we should only use Kubernetes Operators, we decided to go with Cloud Composer.

Cloud Composer is a GCP fully managed workflow orchestration service built on Airflow. It abstracts most of the complexity of having a K8S cluster and enables large scalability and environment separation. Every job is a Docker Container running on a Kubernetes Pod.

We use CircleCI to automate deployment: build container images on DockerHub and deploy DAGs to Composer.

Event Data

We were using Google Analytics to track user behavior on our website, but it has limitations. Data is aggregated and anonymized, it doesn’t support mobile native app (there is a Firebase integration, but it isn’t centralized). GA 360 wasn’t an option, due to its price.

We decided to implement Snowplow as our event data collection platform. Being open-source is a major plus, it handles web, app, and backend event data. The architecture is loosely coupled, so we can modify it if we need to use it for new use cases and it processes data in near real-time.

We capture dozens of millions of events every month and it is the basis of product analytics and realtime event processing (for recommendation systems, fraud analysis, etc).

Part 3: Embracing Streaming

While these services are serving us well until today, there were a few improvements that we needed:

Stitch is a single point of failure. If there was an outage (it was rare, but it happened), we were in trouble. Many teams depend on the platform to operate.
Stitch is batch only, we had ~1-hour latency in this pipeline. For some use cases, such as inventory management, we needed lower latency.
Event data analytics is hard on Looker. We needed a better tool to analyze events data.
Although Looker is designed to monitor metrics in dashboards, it isn’t flexible enough to exploratory analyses.

Lambda Architecture

Solving two problems at once, we developed a near real-time pipeline to ingest data from MongoDB.

This pipeline is based on the Lambda Architecture. Data is loaded in two channels: batch and realtime. We have a deduplication mechanism in BigQuery that considers always the batch as more trustworthy and is completed with data coming from the speed layer (data that wasn’t processed in batch yet).

The speed layer uses MongoDB Change Streams associated with serverless Stitch Functions (these functions don’t have any relationship with Stitch, they are a MongoDB Atlas Feature).

These functions are triggered with any update creation or update and send a request to the publisher. A message is published on a Pub/Sub topic and consumed by the loader and uses the BigQuery Streaming Insert API.

This pipeline has two advantages: we have a near real-time MongoDB ingestion pipeline and redundancy on the Stitch data loader.

Product Analytics

While Looker is perfect for monitoring metrics and guaranteeing data governance, it is slow and costly for event analysis. It is hard to build funnels or cohorts. It usually requires deep SQL and LookML knowledge.

To empower Product Managers, Analysts, and Tech Leads with event data, we built a pipeline to load data to a product analytics tool. We are using Indicative. It is recommended by Snowplow and has a very generous free plan.

Leveraging the Snowplow loosely coupled architecture, we subscribed for its event processing pipeline and loaded data to Indicative API. Snowplow calls this service a Relay. While the Snowplow team developed a Relay to load data to Indicative, it works only on AWS, so we built our own.

We created a new consumer on the Snowplow Event Enriched topic: a Dataflow Job (using a Google template) that loads events in a text file to Storage every 5 minutes. A Function is triggered every time a new file is created on the bucket and loads its content to Indicative.

We had to accumulate 5 minutes events because the Indicative API would timeout with one request per event and the Google Cloud Function service doesn’t have any feature to accumulate messages before processing (AWS has this feature with Lambda and SQS).

AI Notebooks

This feature was added aiming to enable analysts to build more complex analyses. While our data scientists use local JupyterLab daily to model and prototype algorithms, our analysts didn’t have a flexible tool.

Creating a local Python environment, managing git branches, creating local credentials: there are many barriers for new people to notebooks. We simplified this process with Google Cloud AI Notebooks.

AI Notebooks is a managed JupyterLab service. With a few clicks, our analysts could deploy an instance running JupyterLab, in a private environment and it seamlessly integrates with other GCP products (BigQuery) and Git.

So we upgraded our architecture with two streaming, near real-time, serverless pipelines, and a notebook service. The serverless philosophy is strong at the Liv Up data team, we built this with one data engineer. If we had to handle servers, we would be stuck in maintenance. However, BigQuery, Functions, Pub/Sub, Stitch, Stitch Functions, and Composer let us focus on what matters: building.

Looking toward the future

That’s our history so far and there are many challenges ahead.

We are refactoring many transformations from full transformation to incremental to reduce costs and increase processing frequency.
Our production systems are evolving towards a micro-services architecture demanding changes in our data loading services.
We have to deal with a new privacy law.
Our analytics team expanded from 5 to 15 people in a few months, so we need to ramp everybody up and guarantee everyone can contribute.
We are studying the creation of a new role: the analytics engineer to scale our platform as complexity grows.
As we build our data science products, we are investing in machine learning engineering to improve productivity on machine learning projects.

If you want to build the food company of our time and join an awesome data team, let us know. Any feedback on the architecture is appreciated! You can find me on LinkedIn and find our open positions on our career page.