Zego’s Data Journey: First, get the data

Published in

Zego

4 min readOct 18, 2018

Welcome to our first post about Zego’s journey to be a data driven company. At Zego, we believe that a structured and consistent approach to data collection and storage should lie in the heart of any data stack. With our foundations firmly rooted in that principle, we set about building out our data journey.

Many of us have been in situations where a specific data set is available, but almost impossible to access. Usually the first problem is to find who can grant programmatic access to the data (just getting management sign-off can take a while!). After that, writing scripts to extract and cleanse the data, which depends on the skills available, will consume some more time. How is that data going to be updated?

We have all been there and we wanted to make sure that Zegons could have a better data experience.

In the first part of our series on Zego’s Data Journey, we’ll cover the key components of our data stack, why we choose a specific technology and what our experience has been so far.

Data Storage

After considering BigQuery, Snowflake and Redshift we opted for Redshift. The reasons behind this decision were primarily to do with scalability, using SQL Postgres-like syntax, being optimised for batch loading and for large reading operations. Integrating seamlessly with the rest of our AWS stack was an additional benefit. After an initial implementation, not to mention a fair amount of research and experimentation, we went on an optimisation period where we improved:

Table distribution strategy and keys to minimise the partitions accessed when performing queries, specially with joins
Sorting keys to improve the most common reading operations
Column compression to allow for quicker reading (for instance using byte dictionary) and quicker storage (for instance delta encoding for sequential values)
Automation of database maintenance operations like vacuuming and stats updates helped avoid performance deterioration and down time
Using Workload Management rules to split queues based on usage and manage cluster resource allocation across different functions using the data warehouse

In terms of data warehouse design we aimed to keep it simple, flexible and documented:

One schema per data source, making data lineage clear and intuitive
Separation of raw and derived data
Load raw data as close to the source as possible
Permissions granted at data source level only (using schemas)
Documentation within the database making sure every table or field has a minimum description in plain English

Data Pipelines

The second key component of any data stack is the technology that moves data between systems in a consistent and reliable way. Our initial requirements were:

We don’t have time or resources to reinvent the wheel so we should minimise bespoke development
It needs to scale and therefore support parallel processing
We don’t want to be tied to a vendor, nor do we want our database or infrastructure choices limited by the data pipelines technology we chose
Data pipelines should be built in code and version controlled
Our data pipelines should be flexible to changes in sources systems
Support functionality like scheduling, alerts, logging and monitoring should be available off-the-shelf

After considering Luigi, Apache Airflow and AWS Glue, we decided that Airflow would meet the initial requirements. Additionally, Airflow already had a vibrant development community with 500+ Git contributors and was in use in many top global companies. Here are some of the lessons we learn during the implementation:

Scheduling and tasks are separated for good reasons and sticking to it will make the code base easier to maintain and the scheduler more predictable
Airflow supports very complex data pipelines, but highly complex dependencies tend to result in less stability
There are lots of advanced features in it but it is possible to get a full ETL solution in production without using all of them; starting with the local executor and a Postgres backend worked for us to get it off the ground
Idempotence is really important and will determine the ease to fix ETLs when things go wrong
Logging, monitoring and alerts should not be afterthoughts; it pays off from day 1 and with Airflow the overhead is minimal

Business Intelligence (BI) platform

In an enterprise environment, the direct access to data warehouses is usually limited to advanced users. The majority of users would consume data through a third-party Business Intelligence platform, sometimes also referred to as the BI suite, BI layer or just BI. The BI market is highly fragmented with platforms distinguishing themselves based on levels of self-service analytics, sophistication of visualisations and support of data sources.

Our initial choice of BI platform was Mode Analytics. Some areas of our company have taken to this more than others, but we will be asking every Zegon for their feedback in order to improve the overall experience. The use of plain SQL to interact with the data warehouse and the use of Python to extend visualisations have been extremely beneficial. However there is a distinct lack of a community with Mode Analytics, with little technical support to compensate for this.

Zego’s Analytics Data Warehouse has now been live for 8 months, covering 7 internal and third party data sources (4 in the pipeline) and 1.3 million raw data points per day. The data pipelines are stable and we have firmly established a culture of data sharing.

As Zego grows, so do its data needs: Zegons want data in greater volumes, which is more up-to-date and easier to use. It is our intention to spend the next four months understanding the detailed needs of our customers and delivering the data experience that will enable us to progress in our Data Journey.

Zego’s Data Journey: First, get the data

Written by Rui Lopes