The Story of Building the Data Lake at Varo

Vinod Pazhettilkambarath
Engineering @Varo
Published in
5 min readMar 11, 2020

What is a Data Lake? What problem are we solving by building a Data Lake? Please note that throughout this blog, we will be using Varolake and Data Lake interchangeably.

In today’s world, applications are built to solve almost any problem. The number of applications is bound to grow at a rapid pace, day after day. The most fundamental entity to operate these applications is data. In a business like ours, these applications are built to serve a specific purpose(s) which is geared toward driving some key business decisions/initiatives. For example, you might want to send a small gift to your customer for making his 100th cash deposit of $100 or more per deposit.

The above diagram shows a customer posting a mobile check deposit. This transaction is facilitated by the banking transactional system which stores the data in an OLTP database. The banking system’s primary job is to help facilitate all banking daily transactional needs of the customer (ex., check the balance, make a deposit, withdraw money, etc..).

Rewarding a customer is non-transactional in nature and is based on customer history and hence the org will have another application for performing the functionalities of rewards. This rewards application will need to consume the transactional data periodically from the banking system in order to perform its daily tasks (such as identifying customers that hit specific rewards criteria, sending them discount coupons to major restaurants, etc..).

The above architecture works well for a small organization but there is one major problem with this approach if the organization is planning to scale over a period of time. As you can see in the above diagram, the transactional data is copied over to the rewards DB periodically, which means there is redundant data distributed in multiple systems of the organization. This causes an additional overhead of managing the same data by multiple systems. Although storage is cheap these days, the overhead of data governance, monitoring, security, preserving data integrity is doubled. This overhead increases as and when new applications are built within the organization. Eventually, data will be scattered all over the organization as shown in the diagram below.

As you can see, there are four applications in the organization, each having a specific list of tasks to accomplish. These applications consume data from either internal data sources or from external data sources thereby creating multiple pools of data scattered all over the organization.

The main disadvantages of this approach are the following

  1. A high chance of data being inconsistent between multiple systems and thereby causing doubts about the validity of the data and confusion to which system to trust.
  2. Adds data governance overhead, such as tracking data lineage for the same set of data in multiple systems which is a major overhead;
  3. Adds the overhead of monitoring data quality (for the same set of data) in multiple systems.
  4. Adds overhead of developing extraction, transform and load processes all over the organization in order to interchange data between individual systems.

To address these issues, and of having data distributed across the organization, at Varo we came up with a plan to centralize our data into a single location. The answer to this problem was building a Data Lake, and thus was born Varolake! Varolake is built to be the central repository of all our data (both internal and external) and is used as the primary data source for both internal and external applications.

As you can see in the above diagram, data (structured, semi-structured and unstructured) from both external and internal systems are streamed or batched into the Data Lake. Applications 1 and 4 produce data that is ingested into the Data Lake. Application 2 consumes the data produced by application 1 and 4. Since data produced by both 1 and 4 are now available in the Data Lake, application 2 just has to consume the required data from the Data Lake.

Following are some of the many benefits of having a Data Lake in an organization

  1. Data consistency across the organization
  2. Centralized data governance
  3. ETL process all centralized at one single location thereby allowing the application teams within the organization to focus on developing the features and worry less about ETL, data security and governance
  4. Centralized cleansed data available for consumption across various teams within the organization
  5. One central location for the whole company to run ad-hoc queries against data produced by one or many applications
  6. Central location for accessing RAW and Curated/Cleansed data

Varolake Architecture

Now that we know the benefits of building a Data Lake, let me dive into the bit more detailed architecture of Varolake. The Varolake (viz., the Data Lake built at Varo, nicknamed Varolake), is comprised of four main zones or layers.

The first zone is referred to as the archive/landing zone and the second one is referred to as the Raw zone. Data from both internal and external systems enter the lake through this landing zone. Here, in the landing zone, the file format and the data structure is preserved. Data is then moved from the archive layer to the raw zone. Here, very minimal data transformation is done, but the data is compressed and stored as a parquet file format for better data access performance. Data from the raw zone is then cleansed, normalized and made available in the curated zone for consumption by any entities/teams within the organization. Teams can use tools such as Athena or QuickSight to perform ad-hoc queries against the data in the curated zone. For reporting applications, such as BI, data is rolled by and this aggregated data is made available in the aggregation zone.

Having such clear distinct features within each zone makes it easy for internal teams to locate the data that they are looking for in the Data Lake. For example, if someone wants to find data about total active customers for a given period (ie, date, total_active_customers), then they can just look for this dataset in the aggregation zone.

If you have reached this far in the blog, then I hope that you now have a clear understanding of what a Data Lake is why it is such an important part of an organization. Our mission here at Varo is to build products for our customers, which will make their lives a lot easier. We are adopting the same mission to help build a better system and make the lives of the internal teams a lot less stressful!

I hope you enjoyed reading my blog, until next time, Ciao!

Vinod

--

--

Vinod Pazhettilkambarath
Engineering @Varo

Vinod is a data enthusiast with a profound love for data. Apart from work, Vinod loves spending quality time with his family and playing cricket