Data Platform Architecture at Hurb.com

Published in

hurb.engineering

11 min readAug 2, 2021

At Hurb.com, the Data Engineering team continually develops our Data Platform with two well-defined and straightforward principles in mind: establish a single source of truth and make data trustable. Despite size and complexity, these two principles guide every decision we make, from choosing tools to recruiting new members to work on the platform. Both support the team’s general mission of democratizing information in the company.

We believe establishing a single source of truth is the must-go path for every company aiming to scale a data-driven culture in parallel with a rocket curve grown in business size and complexity. Every day at Hurb new analysts and machine learning engineers need to squeeze data in different ways to extract insights, and they need to do it fast. Therefore, establishing a commonplace to go when needing data is one of the most fundamental steps for data democratization and productivity.

Gathering data should be toaster-like, so simple you need to strive to do it wrong. Analysts mustn’t be valued because they master a data visualization tool or a database, but because of their mindset and sharp analytical insights. We believe people should take a couple of hours learning the minimum amount of tools and then do what they are best at, which hopefully is not searching tricks and making fancy graphs in some tool like PowerBI. Moreover, adopting a single source of truth keeps the data platform as simple as possible. Infrastructure centralization helps fast response to bugs and data incidents, making sure everyone faces a minimum idle time when doing their jobs.

The second principle, making data trustable, is crucial to reinforce a company culture based on data when in disagreeable discussions. Trustable data is the ammunition that grants everyone the power to hunt HiPPOs and ensure the final decision will always be principled. Besides that, it also allows us to adopt a data-centric approach when developing models and reduces the complexity of the work from our fellow friends Data Scientists and ML Engineers.

Complementing these two key principles, we also established some general guidelines for designing new software solutions and use-cases for our platform. These guidelines are nothing more than general rules to streamline the decisions of choosing new tools or getting consensus in disagreeable discussions. We have plenty of them, but the most important ones are about the “third-party vs. in-house” and “security vs. abundant data freedom” first-glance dichotomies.

We chose to host most of the solutions we use in our cloud, primarily because technology is the core at Hurb.com, flexibility is necessary, and security is a must. In this context, the use of open-source solutions fits like a glove to our needs. As Brazilians, the love for open-source is in our DNA, since the development of the Demoiselle project by Santos Dumont (the first “open-source airplane”!). We believe an active community is the best engine to keep technology at the edge of development and absorb entropy from many different companies to make the system more robust. It motivates us to choose most of the technologies that will be mentioned further ahead in this article.

However, open-source is not always an option because, sometimes, it does not exist or even because of time or human resource restrictions to implement or learn a new solution. In these cases, we adopt serverless solutions in our cloud to advance faster and reduce the team’s energy needed. Only as a last option do we evaluate developing a new solution from scratch or fully delegate the software for a black-box third-party partner.

The last general rule worth noting for Hurb’s Data Platform is the architecture based on affordances and restrictions. Inspired by Donald Norman´s “Design of Everyday Things” book, we try to design software solutions where the use is just direct and obvious. If there is something analysts shouldn’t do, it must be invisible to them and impossible to execute. If analysts can see something in the UI, it’s probably something they can explore following just some small set of rules.

Finally, we will present in the following sections the cornerstone technologies and frameworks of our data Data Platform at Hurb.com. The article will be divided into three sections:

Data pipelines Section — will present the problems we face to move and transform tons of data among systems;
Data Quality and Observability Section — will mention how we check if there is “garbage in” and the processes and frameworks we have created;
Data Discovery and Serving Section — will overview the technologies and challenges of providing data at scale to over 700+ collaborators in an accessible and actionable fashion.

In this first article, we will present the overview and general motivations of our Data Platform Architecture. After that, the details about implementation will be discussed in a series of topics every month on our Medium page.

Data Pipelines Architecture

Pipelines (or DAGs, to be more specific) are the roads that lead to our single source of truth. Every day we need to move terabytes of data from hundreds of sources using dozens of different pipelines. More than moving, in most cases, it’s necessary to do transformations, apply basic business rules, and check the consistency before putting some raw data in our Data Warehouse.

Furthermore, performing the data moving and transformations is not enough. Because of Hurb’s scale, hundreds of processes and ETLs work parallel and are almost impossible to track without efficient monitoring and logging.

Strong community and flexible technology were core factors to choose our architecture. When problems happen and everything breaks (and it will someday), it’s essential to find help in other people’s experiences and have the flexibility to rebuild everything from scratch, if needed. In addition, we need to construct pipelines robust enough to provide data reliability and without losing the capability of fast prototyping, all of this with a small team of Data Engineers (we are four today and hiring!).

Most of the pipelines work in batch or micro-batches, which Apache Airflow hosted in our Kubernetes cluster covering most cases. We chose this stack because it’s battle-proven by many other companies and makes it easy to turn code from different teams (e.g., data analysts and ML engineers) into scalable routines.

As a complement, we adopt Apache Beam and Dataflow as the official tools to deal with more intense volumes of data. We unified the code for batch and streaming processing with these tools and got a robust system to monitor and auto-scale jobs. The only difference is that for batch jobs the most common source of data is Google Storage, while for streaming is Google pub/sub. We also have the flexibility to change to Spark or Flink at any time using the same Apache Beam code. Moreover, despite being a proprietary solution, there are public papers describing the core technologies behind Google Dataflow runner (check FlumeJava and MillWheel).

For monitoring, we use the Airflow and Dataflow tracking interface, combined with dashboards and metrics of our Kubernetes cluster created in Grafana and Metabase.

Last, we identify the requirement of prototyping partner integrations (mostly from marketing) that need a quick delivery to prove their value or become obsolete in days or weeks. We use Fivetran to deal with this case, a third-party tool for data ingestions. If the data coming from Fivetran prove helpful after some time, we can or will turn it into one in-house pipeline.

Data Quality and Observability

High confidence in data quality is mandatory to build data-centric models and achieve a data-driven culture. Delivering data with good standards and normalization in a fast-growing company during many systems migrations and partner integrations is one of the biggest challenges in Data Engineering. People often want their data available as soon as possible, whatever it costs (and a false report can cost a lot!), and it’s Data Engineering’s responsibility to mitigate false information ingestion and be the principal advocate for data quality in the company.

At Hurb.com, we see Data Quality/Observability as a set of layers that aim to ensure more control and consensus about the meaning and shape of all data and metrics that go to our single source of truth. It’s less about brushing data until it becomes perfect and much more about everyone on the same page, about the data signals and limitations. Looking from this perspective, we invest time building either tools and processes to grow the quality and observability of our data.

One of the first initiatives to increase data quality was to create quality standards once we cannot track our advances (or complain with our engineering stakeholders) if we can’t even measure it. So, highly inspired by Airbnb’s Minerva, we developed documents describing the three different data quality standards we expect: gold, silver, and bronze. Bronze is the minimum amount of quality necessary to be integrated into our Data Warehouse, and it requires things like code documentation, project documentation on JIRA, and alignment with data providers (internals or externals). Gold is the standard design for critical data, and it requires statistical tests, dedicated monitoring dashboards, and processes to handle incorrect data.

We also adopt Great Expectations as our “central data quality tool.” Data engineers, business analysts, and product teams work together, including business rules to perform sanity checks and add more data validations. We also generate Data Docs that use the Great Expectation code and create an easy-to-consume overview of our data and their expected shape.

Finally, we designed affordances and restrictions using Dataform for views and data accessibility management. All data visible outside data teams must come directly from a pipeline or pass by a peer review validation. It’s crucial when many analysts create and manipulate different Views and calculate metrics in the database on their own. To avoid depreciation in the Views-making process, we implemented Dataform to apply the best software development practices in database management.

View creation passes by a peer review between data engineering and data analysts teams to ensure query optimizations and business rules sanity checks. General users can only access views or specialized tables, which we put in segregated datasets at BigQuery. It ensures the data available has the quality and permits us to grant read access to all teams to the Data Warehouse segments. Thus, they can explore freely with few chances to feed their analysis with garbage data.

Data Discovery and Serving

When serving data at scale, the key question is: how to make the information as available as possible, without losing control of costs and avoiding final users using garbage data or missing metrics in their analysis?

There are many ways to address the question above, but all start by turning the consumption of huge volumes of data fast, stable, and multi-platform. We found in BigQuery a serverless, cost-effective, and almost plug-and-play solution to deal with these matters. With a serverless Data Warehouse, we can sleep at night without fear of downtimes or nightmares about pricing, commonly underestimated when starting a Data Platform.

The better open-source alternative to BigQuery is Presto, a more cost-effective option when dealing with a massive scale of data (petabytes) every day. We chose to go with BigQuery because:

We estimate our volume of data will not grow so much in the next three years;
BigQuery provides a better UX to analysts;
Better clients to consume from Python and other languages and integration with GCP;
Even being proprietary software, some papers describe how BigQuery works (check Dremel and Colossus).

Once we have the data somewhere, a second challenge arises: how to make it useful? A lot about data swamps and strategies to derive insights from data are discussed in the literature, but there are no clear paths to go.

Part of the Data & Analytics Tribe mission is to democratize data access and promote data literacy. To achieve this, we need a data visualization tool where we don’t need to worry about scaling the number of users. Instead, we need to face the reality that if adding a new user to the platform costs some pennies (let’s say $5/month), even the most progressive executives will have second thoughts about the necessity of putting hundreds or thousands of employees on this platform at some time.

We chose Metabase as our primary data visualization solution, mainly because it’s open-source and UX-focused for the non-expert user. Hosting Metabase in our Kubernetes cluster, enabled us to scale for an arbitrary number of users with no costs, which allowed us to give everyone in the company the capacity to create queries and dashboards. Today we have more than 700 collaborators (including customer service) with access to data from all company areas. We believe a permissive approach is the best way to promote innovation and a holistic view to all Hurb’s employees (we have people from customer service and human resources questioning our product’s metrics in real-time).

The crown jewel for data serving is Data Discovery. We discovered that it is meaningless to make all data available and give cross-area access permission, if nobody knows what the columns and tables from other areas mean. Furthermore, there is no genuine data democratization if final users are ignorant about all the possibilities to squeeze data and information.

We chose Amundsen as our data discovery tool, which is an open-source solution developed by Lyft. In general words, we use it as a collaborative dictionary about data and metrics, where different teams fill in about the data that they are more comfortable with. Also, we use it to make public the information about table ownership and usage.

Conclusion

In this article, we overview the data architecture at Hurb.com, one of the major OTAs (Online Travel Agencies) in Latin America. It shows the general shape of an infrastructure that we prove to be suitable in a company with 700+ collaborators, hundreds of thousands of users each day, and hundreds of new orders per minute.

Join the crew if you like what we are doing at Hurb! We are continually seeking sharp analytical people for our Data & Analytics Tribe. We have offices in Rio de Janeiro, Porto, and soon in Montreal.

Data Platform Architecture at Hurb.com

Data Pipelines Architecture

Data Quality and Observability

Data Discovery and Serving

Conclusion

Written by Lucas Rolim