Imre Nagi, Software Engineer, Data Team; Chairuni Aulia Nusapati, Software Engineer, Data Team
Having an advanced data ecosystem is the dream of any data team. Yet, however much you want to own such a system, still, the most important thing is how the business can get benefit from those data you store and process in your system. And as Traveloka Data’s software engineers, one of our important jobs is to take care of the data delivery to those business activities’ use cases, so that the business can benefit from the data they generate.
In this blog, we’ll talk about one of our latest solution to deliver data for business, the Data Lake API. This API’s main function is to deliver big sized data (million of records, GB in size) from our data lake to production systems by request.
The reason why we build the Data Lake API is that the existing method we provide to the other team within Traveloka to access our Data Lake is somewhat problematic. To give you the picture, here is how we used to do it. When a team had a use case that needs data from our Data Lake, we simply gave them the direct read access to the buckets/tables that they need and they process the data on their own.
Such approach had proven itself problematic. First, this solution forced a tight coupling to the underlying data storage technology and format. Once we change either the technology or the format, which we have tried, the client system need to change and that’s not easy. Second, because the access is on a bucket level, we cannot make sure that they do not access the columns that they are not supposed to. Finally, it is hard to do audits because we didn’t have logs about what data that has been downloaded or directly queried from the data source.
Based on those problems, we decided to start the initiative to build a standardised way to serve our data, which later became the Data Lake API.
Here is the final architecture of our system.
The Data Source: Data Lake on BigQuery
Traveloka uses the Google Cloud BigQuery as the underlying technology for its Data Lake. The BigQuery Data Lake store an abundance of raw factual and historical data that are generated based on our customers’ activities. In addition to those raw data, we also have processed data stored as intermediate and final tables. Those raw and processed data are the main items to be served by the Data Lake API.
The Job Scheduler: Cloud Composer
The Data Lake API system uses the Google Cloud Composer to schedule many BigQuery processings that transform raw data into summarized and redacted version of them, i.e. the processed intermediate and final tables. The existence of a scheduler is integral to the whole system.
We used to use a self-managed Airflow scheduling system hosted on Google Compute Engines. However, recently we have moved our schedules to the Cloud Composer. Google Cloud had recently released the GA version of Cloud Composer at the time we write this blog, and we decide to use it in order to reduce the overhead of maintaining our self-managed Airflow cluster.
The Query Result Temporary Storage: Cloud Storage
Since the data that the Data Lake API serves are huge in size, we decided to store the result in a temporary storage and let our clients retrieve them directly from that storage. This had simplified our effort to send the result to our clients; it has been taken care of by the storage’s technology. We choose to use Google Cloud Storage as the technology for this storage. The Google Cloud Storage makes it easy for us to send the query results, we simply create a signed URL for every result and we can use the retention policy to delete the data after a period of time.
The Service Metadata Storage: Cloud SQL (MySQL)
We process the requests asynchronously, meaning that we do not return the result immediately but instead queue them until our querying resources are free. As a result, we need to have a metadata storage that saves the state of each request, including a link to the result when the processing has finished. We used the Google Cloud SQL (MySQL) as the underlying technology for this.
The Deployment: Kubernetes Engine
The Data Lake API is hosted in Kubernetes clusters managed by the Google Kubernetes Engine. The Kubernetes cluster communicates with Google Cloud Storage and Google Cloud SQL to store the results and job metadata of query requested by the user. You will find more detail about this in the following sections.
How Does The Data Lake API Work?
Here’s how it works from a higher point of view. (1) The client creates a request, (2) then we’ll ask BigQuery to process a corresponding query, (3) when the processing has finished, BigQuery will dump the result into a folder in GCS, (4) a signed URL to that GCS folder is created and passed back to the client, (5) finally, the client can retrieve the result from GCS.
But how does exactly The Data Lake API work underneath? Well, to explain that, we need to show the major components building the API: The Query Spec Builder and The Execution Pipeline.
The Query Spec Builder
At the very beginning, a user submits a job containing JSON formatted-query specifying the data they want to retrieve under some conditional rules. By accepting JSON, we are able to hide the physical data lake layer we are using from our users. This JSON format technically should specify one or multiple data source definitions and its retrieval conditional rules if any, as given in our data contract. To give you the context, a data source definition is a definition of what and how multiple datasets or materialized view should be queried.
Query Spec Builder is an extensible query interpreter which can interpret the JSON query into our own Domain Specific Language (DSL). Later on, we will use this DSL to construct the query of any types of database implementation that we use. This is achievable by extending the interface we defined before. Even though we are currently using Google Cloud BigQuery as our data lake, changing our database physical layer eventually will not impact the query created by our user. We also run query check against our Access Control List (ACL) service in the query spec builder to make sure that the data in very granular level (up to columnar data) are protected and no data will be given to unauthorized users.
If the job submission goes well, the server returns job metadata containing job identifier which might be used later by the user to check the status of the job and to get URL used to download the results.
The Execution Pipeline
Once the job metadata is saved in our metadata storage, BigQuery SQL constructed from the user’s JSON query will be executed on the execution pipeline. We spawn a worker which will execute the job on the background. To achieve this, We are currently using RXJava to help us maintaining the scheduler and the worker’s lifecycle. The RXJava thread basically will help the worker to finish these following consecutive tasks.
Submitting the BigQuery SQL to BigQuery
Retrieving data from BigQuery is not a difficult task because Google provided us with its Java API.
The simplest one is the standard configuration. By using this technique, we will get the iterator of the data once the job is finish. To read the data, we need to keep iterating over the iterator and read all results from the job. However, we don’t really want to go to all iterators and read all the data. Imagine if the size of the data is extremely huge. What we want instead is to make the service lightweight and move all heavy weighting task to the query engine. So, we decided to go with another solution by writing the result into a permanent table.
Extracting BigQuery Table To Google Cloud Storage
Since we are storing the query result in a permanent table, we need to export the data from this table into our Google Cloud Storage so that we can generate time-limited Sign URL later. However, we must be very careful about running this operation. If the data is small enough (less than 1GB), we can immediately store the object into one blob object in our storage. On the other hand, if the result size is bigger, we need to use a wildcard to specify the bucket of the blob objects so that BigQuery can split the result into multiple different files.
Generating Sign URL for objects in GCS
Now the results files have been stored in GCS. The next step is to deliver them to our end user. So, the right question to ask now is what is the best way to do so. We have agreed that giving the direct GCS access for our user is pretty easy but not desirable due to the additionally required task of maintaining a separate ACL for the GCS users. Fortunately, we discovered that GCS has the signed URL feature. This is a feature that allows us to give our users a temporary access to objects for a configurable amount of time, even if they don’t have a Google account. Other than that, the usage of pre-sign URL is actually not tightly coupled with any cloud vendor.
The Data Lake API has solved problems used to exist in our old method of big data delivery. We now have a clear API contract that standardised the way product teams access our Data Lake. Product teams will no longer access the physical layer, rather they will use our API, and that helps us to audit the use of our data. We can also flexibly define the access control up to column level, this enforces the product team to use only the columns that they need. Additionally, the API gives a standard yet flexible definition that other team can use to query the data, so we can restrict how the product teams access our data while still allowing them to do lots of types of querying.
The API has been used by several business use cases already, and more use cases are going to be integrated to it.