Streaming pipelines using Dataflow and Apache Beam

Published in

hurb.engineering

8 min readOct 18, 2021

How Apache Beam is helping Hurb’s Data Engineering team create robust and scalable data pipelines for streaming data processing.

The purpose of this article is to describe how our data engineering team uses Apache Beam for streaming data processing through Google Dataflow.

Hurb.com is one of the largest OTAs (Online Travel Agency) in Latin America, and our mission is to optimize travel through technology. We focus our efforts on using technology to give our customers the best experience possible for the best price!

We cannot talk about optimization through technology without talking about the use of data. We use it in different ways, from providing Dashboards to our employees or creating complex ML models to optimize our marketing investments.

Hurb’s Platform by itself generates a lot of data, and besides that, there is also data from external partners. There are dozens of systems working, generating millions of records (some even billions) monthly, where communication between internal systems takes place through the exchange of messages. Moreover, managing all these data ingestion has its complexities. One of the things that we are concerned about is: data must arrive on time. Due to this, our Data Engineering Team increasingly needs to have an infrastructure that is capable of allowing the construction of robust, scalable data pipelines, in addition to being fault-tolerant.

Some records constitute strategic data for the company and therefore need to be available with low latency in our Data Warehouse in Google BigQuery. All our streaming data pipelines depend on events that we receive on Google Pub/Sub. These events are usually generated by an in-house Change Data Capture solution named Araponga from where we receive messages for each update/insert in mapped tables that contain ids that we need to use to request data from APIs, similar to Debezium. After that, we extract the IDs and query the corresponding APIs to fetch data, then we perform the necessary data processing steps. We look for the data from APIs instead of the database because there are several business rules that are the domain of the APIs. Therefore, if a business rule changes, it will be the responsibility of the API to expose the changes. These steps range from structuring the data, cleaning it when necessary and to applying the enrichment with information from other sources. To learn how we handle batch pipelines, see our article Complex tasks orchestration at Hurb with Apache Airflow. To learn more about how we structure our data platform within Hurb, be sure to read the Data Platform Architecture at Hurb.com.

Apache Beam

When we talk about Big Data, we come across several technologies like Apache Flink, Apache Spark, Hadoop, DataFlow, etc. There are several possibilities of frameworks for distributed computing. In addition, we work with pipelines that can run in batch or streaming. Considering this, we looked for an existing tool that could abstract, as much as possible, the interaction with these frameworks. Also, it was important for Hurb to use an open-source tool. We believe that it is a way for us to be interconnected, in a way, with many different companies. Using open-source technology makes us part of a large and active community and allows us to collaborate with the growth of technology. When many people and good companies are using open-source technologies, it makes them more robust and in line with the best practices in the market.

That’s how we found Apache Beam to use in pipelines that work with an intense volume of data. It is a library that simplifies the development of batch or streaming data processing pipelines, running code through a distributed computing framework. As a result, we can efficiently parallelize extensive processes with several machines without knowing how Apache Beam handles data between the cluster.

In addition to Apache Beam simplifying pipeline development, it supports three different programming languages: Python, Java, and Go. We decided to develop our pipelines in Python as it is our “universal language” for the Data Engineering Team.

Once we defined that we would use Apache Beam, we needed to choose which distributed computing framework we would use. Apache Beam has integration with several runners, which makes our code very portable. Considering certain limitations and necessary adjustments due to its nature, we can run data pipelines in Batch or Streaming with the same code written for Apache Beam. This same abstraction happens with runners because we don’t have to worry about whether today we’re going to run the code in Google DataFlow, tomorrow in Apache Flink, or the day after tomorrow in Apache Spark. Apache Beam is responsible for translating the entire data pipeline to communicate with the runners.

Google Dataflow

We chose Google Dataflow for several reasons. The main point was creating robust and scalable data pipelines without worrying about maintaining the infrastructure (upload the machine, install applications, create nodes, etc.). Google Dataflow is a framework for fully serverless distributed computing where, from configuration parameters, all the necessary infrastructure to run the data pipeline, including autoscaling, is created in the Google Cloud environment. That way, we don’t have to worry about going up and keeping a cluster up and running. Creating and maintaining a serverless infrastructure helped us focus on the issues that mattered to the company and speed up the pipeline development process.

In addition, it offers several types of monitoring that are very useful and gives us the necessary visualization to understand, in a simple way, what is happening with the pipeline.

Our hotel’s pipeline looks like this in Google Dataflow. The process is started by monitoring messages that we receive where we get a key to the necessary API query occurs. All the results of API’s return are distributed into several steps that Dataflow is responsible for parallelizing the processing.

By clicking on the frames, it is possible to analyze how a particular step is behaving, and it is also possible to see the logs that this step is printing on the console. In this way, we could carry out an analysis, almost surgical, where bottleneck problems are occurring.

The challenge of creating a streaming pipeline

Our first candidate to run in streaming was our orders pipeline because this one is the center of almost all of our data analytics. We can safely say that more than 80% of the analyses carried out within Hurb needs some information from the order pipeline. The order is the entity that consolidates all information about what was sold on our site, such as hotel, ticket, package, check-in and checkout, value, discount, channel marketing, etc. We receive dozens of messages per second of order events, from updated ones to new ones, and all these messages are processed in streaming.

Before creating our entire data architecture, there was (and still exists) our analytics platform where data was consumed through ETLs using ODBC connections to the databases. If new information was needed, not yet mapped by our Data Warehouse, it was enough to search the database, model the tables, and create the data pipeline. From the company’s strategic decisions, it was necessary to move to a model where all data comes through APIs of the systems themselves. Each system holds all the business rules required to provide data, and implementing this structure was not easy or could be centralized. Since we were looking to build a more robust and scalable data ecosystem, we chose to develop streaming pipelines for data we need in real time consuming APIs instead of retrieve data directly from the databases.

First problem: APIs did exist but were designed for moderate communication between microservices, not for bulk data processing and consumption. As soon as we turned on the pipeline, we had some problems because the systems weren’t ready for the flow, even more so because we would need to do a historic load to day-zero. It was necessary to shorten our relationship with Tech Leads as it was required to adopt the APIs to intense data consumption. After a few months of work, together with Tech Leads, we made the necessary adjustments so that it was possible to leave the pipeline online without harming other systems.

Second problem: The pipeline was already working with automatic scaling without disrupting systems, but many business areas were being impacted by outdated information. We realized that there was a data consistency issue as it was happening that our data did not portray reality, due to lost messages. It was necessary to understand what was happening, correct the problem, and devise strategies to ensure that the data represented reality. We’ve included a step in the process to let us know of issues. If a piece of data is wrong, we have to be the first to know and correct it without human intervention. Note that we were in a learning process and that the entire messaging system and APIs already existed for other purposes, and we were trying to adapt everything to our new needs.

Until the architecture managed to work satisfactorily, it took approximately six months of work. All the process was necessary to organize many things in-house (including the communication between the Data Team and IT Team). Now we are focused on creating an entire architecture for data quality (we will have a post here shortly on the subject) with the objective to ensure the quality of our data

Conclusion

In this article, we explain our process of developing pipelines for streaming processing here at Hurb. First, we describe how we’re using Apache Beam, along with Google Dataflow, to keep a low-latency pipeline running. Second, we talked about the problems we had and about our experience, until we had the first streaming pipeline working satisfactorily.

Join the crew if you like what we are doing at Hurb! We are continually seeking sharp analytical people for our Data & Analytics Tribe. We have offices in Rio de Janeiro, Porto, and soon in Montreal.

Streaming pipelines using Dataflow and Apache Beam

Apache Beam

Google Dataflow

The challenge of creating a streaming pipeline

Conclusion

Written by Roberto Bonnet