XDATA — Challenges we have to make our data streaming pipeline at XP inc.

Recently XP inc. decided to launch XP Card a option for our customers to has a credit card with investback with benefits of use, and a new challenge appear… Catch all of these information in streaming pipelines to make internal products in near real time and real time.

Diogo Miyake
Comunidade XP
Published in
7 min readFeb 9, 2021

--

Source: XP inc. — Xdata

Motivation

Every new product frequently comes with a dose of new challenges. In this article I would like to share some of the challenges our team went through and how we approached them. Our main focus here will be on our data streaming platform, an agnostic solution built to process and deliver near-realtime analytics.

The team

At XP we have a very unique way to approach challenges. Thus, a unique team compositions are always needed.

Our team is composed of:

  • Internal Clients [Usually those in need of the data]
  • Tribes
  • Data Product Manager
  • Tech Leads
  • Data Engineers
  • Data Analystss

Other teams like Processing Services help teams with architecture and IaC. Although we have teams for each product/project we help each other according to the skills and availability.

Challenges

Photo by Lou Levit on Unsplash

Altough our journey we had a lot of challenges throughout the process [and still have], here are some to give some context:

  1. Create and Maintain rhythm of team in Home Office due to covid19;
  2. Deploy the first Streaming Process at XP;
  3. Have clear Definitions of Done through the deliveries;
  4. Prioritization of tasks to deliveries;
  5. Definition and approval of a new data architecture;
  6. Make sure solution is compliant with standards and laws such as GDPR/LGPD;

Questions to be Answered

After looking at the challenges, we had to answer the following questions so that we could move forward with our project.

  • How to start the data collection operation?
  • Which KPI’s can we use?
  • Are these data already being used by other areas? And the LGPD?
  • Do we need all historical data or just the latest snapshot?
  • Will our environment be scalable, resilient?
  • What are the priorities for the first deliveries of the Sprints?
  • How are we going to monitor the environment, pipelines, data volume?
  • Where will this data be read?
  • What value can this data deliver? (this item is most important)

The Solution

Prioritizing Tasks and Use Agile

Our first step was to make sure we had all of our clients pains and problems mapped out. After 2 weeks of intensive discovery and meetings we came up with a prioritized backlog and with some level of certainty that it would generate value.

Agile Methods Save Us — Source: memeGenerator

We use Sprints of 2 Weeks divided by:

  1. First Day:
    1.1 — We catch backlog and we meet in a poker planning to estimate effort (using fibonacci numbers) for items in backlog;
    1.2 — We divided items by skills and points using Story of Sprint;
    1.3 — In some cases we do Pair Programming (This helped we a lot to solve complex problems);
    1.4 — Items greater than 13 points are broken into smaller items;
    1.5 — We reserve some story points to spike of tasks;
  2. From 2nd to Last Day:
    2.1 —
    Every following day we meet to do the 15 minute daily to understand what has been done, what needs to be done and the progress of the task that is in progress;
    2.2 — We use Azure DevOps to maitain our backlog and cards that are to do, doing and done;
  3. Retrospective:
    3.1 — After each end of sprint we got together to do the retro to discuss what we learned, what we should do and what we can continue to do.

Adoption of Data Ops

To make our work easier we use DataOps to create our pipelines, for that we create a robust local environment with Docker, then we create our scripts and Dags for Airflow (here we have two aspects, if the pipeline is generic as an ingestion of a relational database, or customizable as streaming or data modeling for analytics) from that we make a Pull Request, it being approved, we do the test in the cloud DEV environment, if everything goes well the Pipeline go up the belt and go to production.

Processes

Today we have both Batch and Streaming processes.

This article explain the difference between these:

Credits to Research Gate: https://www.researchgate.net/publication/315095908_Astrophysics_and_Big_Data_Challenges_Methods_and_Tools

In this link has an article explaining these processes.

Overview of Our Data Platform

Below image of our archtecture very simplified, we use these tools because are scalable, resilient and have high availability. Of course, in addition to having a huge community that uses them. All of cluster for processing are deployed using Kubernetes.

Very Simplified Archtecture

Basically we use these tools in our streaming pipeline:

Apache Kafka: Event streaming platform.

Apache Spark: Framework that permits create ETL/ELT scripts in languages like Python, Java, Scala, SQL and R. We use Pyspark structured Streaming, the advantages is that we can write batch and streaming scripts that are portables.

Azure Data Lake Storage Gen2: Storage of Azure cloud that we use to manage and maintain our data lake.

Apache Airflow: Data pipelines orchestrator, created by Airbnb and used by a lots of enterprises worldwide.

Azure DevOps: Azure DevOps provides developer services for support teams to plan work, collaborate on code development, and build and deploy applications. Developers can work in the cloud using Azure DevOps Services or on-premises using Azure DevOps Server.

image surce:https://kubernetes.io/images/kubernetes-horizontal-color.png

Kubernetes, also known as K8s, is an open-source system for automating deployment, scaling, and management of containerized applications.

All of tools described above running on kubernetes clusters, in AKS (Azure Kubernetes Services).

Local Environment

To test and make our data pipelines, we have a local environment using containers like docker, this reduce a chance of errors when we make a pull request.

Data Ingestion

At Data ingestion Layer, we have a YAML template that we use to make our ingestion pipelines (for transactional databases). Some of the configs of this YAML are databases, connections, tables, columns and/or queries.

For Custom pipelines we use Custom LivyOperator to make a task in dag for ingestion of data. For example: We have some Dags that run 24/7 for catch kafka data using spark structured streaming and send to data lake.

Transformations/Data Modeling

Transformations are deployed in Spark using pyspark, we have a livyOperator in Apache Aiflow that send Rest requests to Livy to make transformations in our Data Pipelines. Here the Steps of Data Ops are equal Data Ingestion: Local Environment -> Tests -> Pull Request -> Dev -> Prod.

Data Catalog

To Data Governance operations we have a framework called ALMA. We modificate Amundsen and make a frontend to solve our problems.

We attach all informations of tables in ALMA, to all people of XP use and discover what type of data we have.

ALMA Data Catalog

What we hope for the future

For the future we hope not only to share more of our learnings, but we are looking to leave the entire data-oriented company, soon we will bring more news not only here on Medium but also on our social networks, about our initiatives.

Thanks for reading this far.

Sign Up to XDATA careers

About us:

  • Big dream: We are eternal explorers, because our vision has no limits and our goals transform the impossible into reality.
  • Open mind: We are not owners of the truth, different points of view is what makes us learn and grow.
  • Entrepreneurial Spirit: We assume our responsibilities and take action with the minimum necessary to generate value. We go there and do it.
If has these qualities sign up to XP inc.

We have several opportunities:

Aknowledgements

Thanking a lot the staff of the Service Line and the Data Engineering Chapter in particular at All people of Nucleo 5 that if it were not for our efforts the project would not leave the paper.

References

Below Some References if you has some interests in tools/frameworks that we use:

--

--

Diogo Miyake
Comunidade XP

Big Data Platform Security Engineer with knowledge in architecture. Enthusiast in Technology, Software, trying to help people to live better.