Data Modelling and Processing in Travel Super App

Joshua Hendinata
Traveloka Engineering Blog
9 min readSep 11, 2019

Authors: Rendy B. Junior (Data System Architect) & Joshua Hendinata (Data System & Tools Engineer)

Photo Credit: Pixabay

A Problem of Data Silo

Traveloka provides a wide-range of travel-related products and services that enrich your life and empower you to discover new experiences, accompanied by excellent post-sales services. Let’s say, you and your friends are planning to have a vacation somewhere abroad. You’ll book a flight to get there, reserve an accommodation to stay at night, buy a roaming package to stay connected with relatives at home (or upload pictures to your favorite social media) and so on. Those needs and many others are available at Traveloka.

Traveloka provides many services in our app that each of them has its own business process and transaction flow. In order to make each of the business unit more agile and able to get insightful data faster, we democratize data wrangling, which affords each business unit to create its own data model. Since the business process and nature of the products are different from one another, the data model on each of the product is also different. This difference causes challenging data analysis across business units, especially for cross-functional teams, such as Marketing, User Engagement, and Finance team, for they have to process data across all business units into something uniform so that they can be analyzed. For example, if I want to know the company-wide’s CLV (Customer Lifetime Value), how do I get this insight if sales data from each business unit comes in different schema?

Addressing Data Model Silo using Schema Inheritance Concept

The inheritance concept used here is borrowed from object-oriented programming, whereby a child will inherit all properties that its parent has. For example, we have `vehicle` as parent with `fuel_capacity` as property. `car` and `motorbike` are the children that will inherit `fuel_capacity` property.

Likewise, we address data silo problem using a design, where standard of data model across business units is enforced using inheritance concept. Please refer to the example below for our generic sales schema. Inheritance of schema guarantees that all sales data from all business units will have the columns below. Another alternative to enforcing the use of schema, is to apply a convention, where all data models will be reviewed by some engineers, appointed as data model review board. However, we did not go this alternate route because manual review is error prone. Using schema inheritance concept, we only need to define the parent or the generic schema and all the business units will have to follow through inheritance.

Central Team defines central schemas and maintain an inheritance tree so that it is clear which node of the tree, a table should inherit from. In the example below, you can see that Flight Sales Table inherits properties from Sales All Product Schema, User Behaviour Schema, and All Event Schema. This means that Flight Sales Table can be analyzed with all the other tables, which inherit properties from those tree nodes.

Enough about concept, let’s dive into the implementation. We called this implementation as NeoDDL (taken from DDL; Data Description Language). NeoDDL schema inheritance is written in the form of several yaml files. In each of these files, we define what usually is specified in a DDL, plus its parent schema. As you can see in the example below, `search_flight.yaml` inherits from `user_event.yaml`, which in turn inherits from `generic_event.yaml`. The framework will resolve the schema recursively up to the root of the tree.

Now, it is easy to answer the question we have asked before, which is to get the CLV of customers across all business units. What we need to do is just fire a simple query below.

SELECT user.id, SUM(profit)

FROM fact_sales_*

GROUP BY 1

Data Cleansing Issue across Business Units

When we did the exercise on addressing data model silo issue, we also found out the issues with data cleansing that was being done in each of the businesses. The impediment to software maintainability is code duplication and we see instances of duplicated effort with slightly different method, despite having the same goal. We saw that most of the cleansing logic revolved around few business rules such as checking string pattern, validating against possible values, checking min / max value, and checking other simple rules. Then, we developed those common-but-simple rules and integrated them into our NeoDDL. This means, NeoDDL is also being used by our cleansing framework as the source of rule to execute, dubbed “an executable spec”.

This executable spec concept enables collaboration across roles. Analytics team can contribute to add business rules to the schema without touching any code. NeoDDL is also integrated to Data Catalog, meaning all the descriptions written in NeoDDL are immediately searchable and discoverable from our data catalog. Every change on the rules and descriptions will be a pull request and once approved, will be available on production by the help of our CI/CD pipeline.

Please find below an example of descriptions and rules in NeoDDL. Those highlighted are examples of rules on NeoDDL that specify a string pattern to check, min/max value to set constraints on, and an enum to match against a set of possible values.

NeoDDL Implementation in Our Data Warehouse

Our current data warehouse uses multilayered architecture combined with data warehouse bus approach described by Ralph Kimball. The simplified architecture diagram can be seen below.

Traveloka Data Warehouse Architecture (Simplified)

In the first layer (Raw Data), event tracking and product DB’s data are ingested and stored directly into BigQuery. The data in this layer are preserved to be as similar as possible from the moment they are generated with little to no transformation in between.

From the first to second layer (Cleansed Raw), the data are validated using a set of business rules, which can be common company-wide or specific to individual business unit. Personally Identifiable Information (PII) data are also being masked and encrypted in this layer.

The third and fourth layers are the place where the data are modelled according to the dimensional modelling concepts (facts and dimensions). The only difference between these two layers is that the fourth layer contains the aggregated version of the third layer; be it based on time (e.g. weekly, monthly, yearly) or across business processes (e.g. conversion rate).

NeoDDL for Table Creation

NeoDDL is being used to create the table from second to fourth layer of our data warehouse. What’s interesting in our data modelling approach in BigQuery is that, since BigQuery allows nested data structure schema, naturally we want to embed some dimensions into the fact table itself to reduce the number of JOIN statement in a query for better performance.

However, like the data model silo issue mentioned above, we need to keep these embedded dimensions consistent across business units. This is where the inheritance feature of NeoDDL comes in handy! We can define the embedded dimensions inside a “parent” schema and let the fact table schema inherits from it.

For example, user_event schema in the diagram above contains the user and device dimensions. To embed these dimensions inside a fact table, simply specify user_event as the schema’s “parent” and these 2 dimensions will automatically be embedded inside the fact table.

With this approach, we have improved query performance benefit of nested data structure while still maintaining the consistency of embedded dimensions across business units.

NeoDDL for Data Validation

NeoDDL borrows some of the concepts from JSON schema validation to define simple business rules for individual attribute/column in a table. Currently we have:

1. String Attribute Validation

  • Max length
  • Min length
  • Enum
  • Pattern (regex)

2. Numeric Attribute Validation

  • Max value
  • Min value

3. Default value for REQUIRED attribute and/or data type mismatch

We are currently using Dataflow (managed Apache Beam service in Google Cloud Platform) to perform the validation. Each of the validation rules above is converted into individual steps in Dataflow.

NeoDDL configuration (left) will be converted into sequential steps in Dataflow job (right)

Data validations are performed in order as follows:

1. Data Type Casting and Validation

Since we are ingesting data from various sources, data type inconsistencies can and will occur at any given table. This is especially true when we ingest the data from MongoDB, where we end up stringify-ing the record. Simple casting like NUMERIC to STRING or vice versa is straightforward but for a more complex type like STRING to TIMESTAMP or DATETIME, we need to put certain conditional logic as a best effort scenario.

If the field content is castable at this step, it will continue to be validated in the next step. Otherwise, the job will tag the record (explained below) and provide a default value for that particular field.

2. String, Numeric attribute and Enumeration validation

In this step, any violation will result in error tagging and no data transformation will be performed.

3. REQUIRED attribute validation

If the data is NULL in a REQUIRED field, the job will give a default value for that field.

Error Tagging

According to kimball, there are 3 actionable options when data quality check is violated, namely halts the entire validation process, separates the offending record into a separate file, or merely tags the data and pass it through. We choose the third option as different data users in Traveloka have different error tolerance. There might even be a use case to analyze solely the dirty data. For that reason, the Dataflow job’s responsibility is only in error detection and NOT in error correction/mitigation.

Error message resulting from a data quality check violation is stored in an audit dimension embedded inside the table itself. It contains 3 fields:

  1. message: store the explanation for the violation and the original value in case of any transformation.
  2. etl_name: shows the ETL name that catches the violation.
  3. class: shows the Python class that is responsible for validating and producing the error message.
Sample of error tagging. In this case, data type inconsistency is detected

Future Plan

With the continuing development and usage of NeoDDL in Traveloka, we are currently working on some new features:

  1. Adding more business metadata for our internal data catalog integration.
  2. Adding metadata on data model relationship e.g. foreign key from fact table column to dimension column.

Adding these metadata into NeoDDL will allow us to get more context on the DDL or the table itself and the integration with data catalog will enable other data users (e.g. Data Analysts, Data Scientists) to search and find the relevant tables based on their requirements.

Conclusion

In this article, we have presented to you some of the recurring problems in Traveloka’s data environment, which are data model silos and repeated effort for cleaning dirty data. We came up with a new configuration called NeoDDL to alleviate both issues. Hopefully, you can find this article insightful and we would like to see your comments and discussions below.

P.S. This article is part of conference talks at Data Council Singapore 2019. Video: https://www.youtube.com/watch?v=28ifIWMcq_E

--

--

Joshua Hendinata
Traveloka Engineering Blog

I enjoy writing about technology and coding in data engineering space.