Iguazu — The Z-Tech Data Lake is born

Written by: Rodrigo Abdo, Rafael Akimoto and Rodolfo Shideki.

Rodrigo Abdo
ztechbrasil

--

There is a variety of Data Lake architectures on the internet, but today we will explore Iguazu, a Data Lake developed by the Data Engineering Team from Z-Tech, the Anheuser-Busch InBev (AB InBev) technology hub for small to medium-sized businesses. There will be further articles explaining the evolution of the Z-Tech Data Lake, such as: Architecture, Infrastructure, Governance, Team organisation, etc. In this one, we’ll go back in time and show the first steps of the Iguazu — Z-Tech Data Lake.

Z-Tech: AB InBev’s tech hub focused on SMBs

Briefly our mission in Z-Tech (ztech.net) is to empower small and medium-sized businesses to change the world through technology.

We develop the best technology solutions to support the activities of retailers, bars, restaurants, groceries, and other related businesses. We help small and medium-sized entrepreneurs to tap into new revenue streams, lower costs, improve management of inventory, staff, and vendors, and facilitate better decision-making through data-driven insights.

Within Z-Tech’s umbrella we work with and incubate have startups that make a difference in the lives of these small and medium-sized businesses. Z-Tech has operations in Latin American countries such as: Brazil, Mexico, Colombia, Peru, Ecuador, and the Dominican Republic.

Learn more about the startups that have been part of our ecosystem:

· Donus — https://www.soudonus.com.br/

· SiHay — https://sihay.mx/

· GetIn — https://www.getinapp.com.br/

· Menu — http://menu.com.br

· MiMercado — https://www.mimercado.com/

· Modelo Power

Data Engineering Team

In the early days of Z-Tech, our Global Data Engineering team primarily focused on Donus, Z-Tech’s proprietary fintech company in Brazil. Our mission is to organize and provide data in an easy and secure way in order to enable Z-Tech users to do things like: KPI development, generate insights, dashboards, reports and data analysis. And also, move forward to implementing Data Science.

Starting in September 2019, we focused on starting the data lake development. In the beginning, it was a one-person team; the first data engineer was responsible to deploy all data lake stacks and to start the first batch ingestions. After some months, more data engineers were hired to provide more data for Donus and to start the data lake expansion .

Starting Small and Agile

When you think about agile development, the idea of building a spaceship right from the scratch is totally obsolete. Not unlike any other startup in the design phase, resources for implementing a scalable and robust data lake were limited and that’s why we started small and worked agile by adopting the concept of MVP (Minimum, Viable, Product).

https://medium.com/@awilkinson/skateboard-bike-car-6bec841ed96e

We started small with “our skateboard” (v1), with the plan to build up to a car.

During the process of development and usability by the business users, we evaluated if we can evolve or discard the service/product. If the product isn’t good, we discarded and moved on (always fast) to the next product, otherwise we moved forward to the “next levels” (scooter, bike, motorcycle, car…), adding new features to enrich the service/product/platform.

We focused on Donus, and after some months the data lake consumed all Donus data analysts, becoming the single source of truth for data. It was then we started the data lake expansion.

Z-Tech Data Lake — v1

In early 2020, the data lake (v1) was very simple. Using Airflow (we love Airflow :) ), we were able to orchestrate the data extraction. The Airflow deployment is very simple, there are many materials showing how to do it, and also, it’s open source and frequently updated by the community — we always prefer to use open source tools.

Some team members had a deep understanding of Airflow, so we were able to deploy a dockerized Airflow on EC2 and keep it running smoothly. The first datasets from Donus were from different databases, such as: MySQL, Postgres and MongoDB.

To organize the table’s metadata, Glue was being used not only as a metastore catalog but also to crawl over the table schemas. After that, for the final touch, Amazon Athena was used by the data analysts to run SQL queries on the data and Power BI for creating reports and dashboards.

We had excellent feedback from the users at that time: we developed a good platform for the first data analysis (AWS Athena + Power BI) and also, one of the reasons were the SQL/Python and Power BI skills of the data analysts, that helped a lot in the use and expansion of the Data-Driven Culture at Z-Tech.

Data Lake v1

Data Lake Expansion

As the Data Lake was evolving more of our teams started requesting more data to the Data Engineering team, we needed to bring more talents to the team. Once the data lake was being heavily used by Donus, we needed to keep agile to provide data and also to support them.

After our success with Donus, we started the data lake expansion to another Z-Tech Ventures. We moved from the Donus Structure to the Global Structure, now focusing on adding other Z-Tech Ventures into the Data Lake Platform and bringing value to them.

We created three work streams for three Z-Tech Ventures with focused data engineers: Donus as well as MiMercado and Menu — two marketplace platforms based in Mexico and Brazil. The challenge was completely different and we needed to evangelize the MiMercado and Menu users into this Data-Driven Culture.

After the first deliveries in MiMercado and Menu, the businesses learned the value of the Data Lake. Being Data-Driven means taking actions based on data, not just by intuition or experience…

Data Driven

As our work developed, we started gaining the trust of the data analysts, who at the time were only using excel sheets that they had to manually update and then they started using SQL (Athena) to query data from S3. In the blink of an eye, we had almost 10 internal customers giving us feedback on what data would be valuable for the businesses or what could be better in the platform.

We then established some schedules for planning meetings, in these meetings we had a common sense of prioritisation aligned with the data analysts. We created a Jira board for their requirements, and we started to use sprints (2 weeks) to run the developments.

Z-Tech Data Lake — v2

The Data Lake has been evolving since then, but we had to move forward to the next step, which would be to build “our bike” (v2). To achieve that, we made some changes in the S3. We already had one layer, which was the raw layer, every data load and query were running in the same layer. Two more layers were added: the kitchen layer and serving layer.

Few updates on the layers:

Raw Layer - raw data (json.gzip format — default); The Airflow jobs send the data to this layer, also, only the Data Engineering Team has access to it.

Kitchen Layer - layer where the Data Analysts could access and run their queries and create their views using AWS Athena. As the “kitchen” name denotes, they could do whatever they wanted here (“cook” the data), working as a sandbox zone. They were using DBeaver IDE to run their SQL queries.

Serving Layer - we defined this layer as a concept for “company zone”. In this layer we would load master data and tables with certified data by the business areas. We’d be agnostic, using any service we wanted, such as: Apache Kylin, S3, DB’s, API’s, NoSQL, etc…

To improve our platform, we added Presto (running on EMR) with the intention to replace AWS Athena. But as you’ll see in the next steps, Presto ended up only being used by the Data Engineering Team to create tables in the kitchen and serving layers (using orc); the Data Analysts never had the opportunity to query on the Presto Cluster.

Data Lake v2

Baptism of the Z-Tech Data Lake

As our Data Lake was becoming bigger, and we decided to name our platform. We had made a list of all waterfalls, and we picked some for final votes. The winner was Iguaçu Falls, Paraná — Brazil; then, we called our platform Iguazu.

Iguazu Logo
Iguazu Data Lake on v2: Donus, Menu and MiMercado

Next Steps

In the next article we’ll talk about the Iguazu migration (v3), we moved Iguazu from AWS to Microsoft Azure. We’ll show the improvements that we’ve made in the architecture, infrastructure, and some services that we’ve added too.

We now have a car (v3)… that’s to say we are fully on cloud, integrated with data science , running spark clusters for querying and extracting data every day from different Z-Tech Ventures and services.

We don’t see our data platform as a project anymore; now, Iguazu is a product. Our goal is to keep helping the businesses run smoothly and data-driven.

--

--

Rodrigo Abdo
ztechbrasil

Head of Data & Analytics | Data {Engineering, Analytics, Governance, Science, Strategy}