Data Platform Transformation at Bukalapak

Published in

Bukalapak Data

15 min readNov 22, 2020

I have to say that I had an amazing experience with data platform for the past 1 and half years. One of the experiences that transformed my mindset and behavior quite significantly as a professional. The experience I got during June 2019 to June 2020 period at Bukalapak.

After joining Bukalapak in June 2019, for the first time I could see how people think, behave, and work in a very competitive e-commerce market. We moved really fast. Amazing! The organisation was structured in a way so that we didn’t have a big team that would slow us down. The team could make ideas, decisions, and soon pushed it live and experimented it in production. It was too fast that we, in the engineering part, sacrificed a lot in a longer term for a short term speed.

From the engineering perspective, many decisions were made only by considering a short term solution which cost us a lot in a longer term. The system design, the code design, as well as the quality of the services had been the 3 major areas that were sacrificed. Unconsciously, our speed was degrading due to issues and problems either in the infrastructure or in the services. That was the case with the data platform experience I had during June 2019 to June 2020 which I’m about to explain more.

What Happened to the Previous Data Platform

**Figure 1.** The Old Data Platform Architecture

In Figure 1, you could see how the components in the old architecture were interconnected. This system ran in an on-premise data center, therefore all of them were built on the common big data technologies. I might not be able to explain the stack in detail, however I would like to jump to the stories and experiences we had that actually influenced our decision to re-architect the system.

Unreliable Services

Issues happened almost everyday. The engineers were very busy putting out the fire on the system, in order to make sure data was still available and ready to consume. Some of the reasons that caused this are:

Our CDC kept on failing almost every day due to schema mismatch and issues with its checkpoint mechanism. The version of debezium that we used at that time was too problematic in our system and to upgrade it to the newer version would require even more collective effort from other teams.
As a data lake, Apache HBase dropped data quite frequently. We spent a huge amount of time investigating and to understand the underlying issue, however we never managed to get to the long term fix for that.
We didn’t have a quite strong engineering practice at that time. Any work could be deployed to production immediately without complete tests. It was like a gambling for us. The process couldn’t really make us accountable to what we shipped to the production environment.

Infrastructure

Given the amount of engineers that we had in the team, managing and maintaining the infrastructure in on-premise data center while providing the services that the business needed was too difficult. Some of the challenges that we faced were:

Hardware failures were something that we should expect, but we just didn’t have enough energy to anticipate that. There was a case where we had incident in our kafka cluster that took us almost 1 month to recover all the data that we had at that time.
Scaling out was also really tough. Machine procurement would take a lot of time until it was really ready for us to use. We didn’t have any elasticity component, at all, in our infrastructure.
Automation wasn’t really a standard in the data engineering infrastructure at that time. Some had some didn’t. We still had several manual operations that weren’t really efficient.

Slow Data Processing

Without any elasticity in our infrastructure, I don’t think this is a surprise. We had more than 200 nodes to power our data processing in the platform, but unfortunately it almost reached its maximum capacity. We introduced some limitations on the usage to avoid even more severe performance degradation. As you can predict, it also means we degraded the quality of user experience in the platform. We had more complaints about how it was really hard for them to finish their work with our data platform.

Improper Use of Technology

In several cases, we used OLTP for analytical purposes. We used PostgreSQL as our data warehouse and another one was the direct connection from our presto cluster to the OLTP databases. The impact was clear. So many queries would take a lot of time to finish. The experience was really bad for our users.

Inadequate Monitoring Tool

We had a quite comprehensive monitoring and alerting on the infrastructure capacity and usage. However, we didn’t have the same level of monitoring quality from the services perspective. It wasn’t just a clear standard to have a list of metrics to monitor whenever we deploy something to production. Therefore, for me it was quite shocking when I found out that there were several services with no metrics about throughput based on response code, latency, CPU usage, memory usage, et cetera.

It was really an issue for these reasons:

We couldn’t really tell if there was any problem with the service, at all. That was one of the accountability issues that we had.
As a platform owner, it is important for us to understand everything that is happening in the platform. Without any metrics that could guide us, we couldn’t provide a good governance on how the platform should be used. As an example, our presto cluster was the main tool for everybody to query all data that we had in the platform. Since we didn’t have any monitoring as to what jobs were actually consuming most of the compute resources in a particular period of time, then we couldn’t really do any analysis to figure the right approach to handle the situation, both in short and long term.

Operational Bottleneck

In Bukalapak, each tribe (business unit) had the ability to create their own ETL in our data platform. It was expected so that business could move even faster by not relying everything on the engineering team. However, with all the resource limitations that we had, we were forced to introduce some additional stages in our operations, in order to be able to prevent unnecessary usage problems which led to making things a lot worse. That activity negated the initial idea of data decentralisation. Some of the additional operations were:

Before the ETL could be deployed to production, it was required for the data engineering team to review the ETL code. The number of requests came were uneven to the number of engineers that we had in the team. Some of the code might take quite a long time to be deployed.
Since our data warehouse was in a PostgreSQL database, we also introduced a design review process before a table could be deployed in production. The situation was quite similar. We had many requests that we couldn’t handle quickly.

The review process itself is indeed a good process, to ensure the output quality is within the expectation. However, that intention had become a burden to the overall process and slowed everything down.

—

Given all the abovementioned bad experiences, I can’t deny that this architecture already helped the organisation to grow up until this stage. I don’t think this is about theright or wrong architecture to have, but instead it’s about a reminder that we need to have a continuous improvement and evolution of what we already have.

What We Did to Improve That

We got so many lessons to learn from our previous experience and we agreed that it was a very horrible experience. Therefore, it is important for us to move forward and try to transform that into something better, not only for us but also for our users and the business.

What did we actually do to transform and improve that?

Take a Step Back

With all the bad experiences that we had, the least thing that we wanted to do was to take some actions which later led us in the same situation, not in a better place. Therefore, it was important for us to take a step back and think about what we were trying to achieve and why.

As a team of data engineers we spent a fair amount of our time in maintaining infrastructure, just to make sure everything was up and running. It was a good thing, because then the business could keep running with data. However, on the flip side, we didn’t have enough time to innovate and improve the platform that we had so that we could bring more value to the organisation.

That was exactly the goal: do things that could bring more values to the organisation.

At this stage we agreed that as a team of data engineers we needed to focus more on things that could bring more values to the company. From that decision we also agreed that we would have less priority in maintaining infrastructure, therefore we would look more into options involving managed services. With that, we could have more time in innovating and developing our services and data platform even further.

Migration to the Cloud

At the same time, there was a company wide initiative for technology to migrate our infrastructure to the cloud, to be precise it was Google Cloud Platform. We used that as a medium for us to achieve the goal that we stated above.

Basically what we did was to re-think the way we designed our data pipeline and data platform. There were several key considerations that influenced the way we created the architecture.

A. Managed services that offered elasticity

What we were looking for through managed services were the simplicity in maintaining infrastructure as well as the elasticity. For some cases, we didn’t even need to think about infrastructure at all, for example serverless services like BigQuery or Pub/Sub. With this, we could spend less time on the infrastructure and shifted it to something that mattered more to the company, for example reliability of the pipeline or data platform.

Because of this mindset, that was why we had many changes in the technology and introduced services like Google Cloud Pub/Sub, Google Cloud Storage, Google Cloud Functions, Google Cloud Dataflow, Qubole, BigQuery, and Looker.

B. Built in-house Change Data Capture (CDC) to replace debezium

Since Change Data Capture (CDC) was one of our pain points, therefore we paid more attention to how we could solve the problem. The solution was to build a customised CDC that naturally fit with what we needed.

What were the benefits that we got from building our own CDC?

Full snapshot was very slow and in some cases it might take 2 or 3 days to finish a very large table. In our CDC, we added a functionality that would execute Spark script to do a full snapshot. As a result, it would only take 1 or 2 hours (depending on the cluster size) to do a full snapshot of a very large table.
We built a functionality that was highly robust to handle any schema changes. That way we didn’t face any problem at all whenever there were any changes in the schema in MySQL table or whenever there was a new attribute in a MongoDB document.
We had more confidence in our checkpoint mechanism. We had so many tests to make sure the checkpoint mechanism was reliable. Starting right after we cut over to the new CDC, we no longer had a case where there was a data loss when we paused or resumed the stream.
We wrote a lot of unit tests in the code. Not only that, we also had quite extensive integration tests in the staging environment too. That boosted our confidence in the reliability of the service.

C. Decentralised data platform

We structured our organisation by copying the spotify model. Until today, we have more than 10 tribes that operate independently. As the data engineering team, we saw this as an opportunity to create a decentralised data platform where each tribe would have their own data platform. By creating this we could also solve our problem to avoid our users from fighting for resources. Not only that, they would also have an independence to determine the standard they could apply on how the platform can be used.

With that as considerations we looked into the possibility from the technology standpoint and we saw the potential that the cloud services could give us. The simplicity in the infrastructure operations gave these ideas:

Each of the data platforms would have their own airflow scheduler and webserver.
This airflow would use KubernetesPodOperator as the executor. That would provide us the scalability and elasticity without worrying about the possibility of users fighting for resources.
Each of the data platforms would have their own storage, be it Google Cloud Storage or BigQuery.
Other than KubernetesPodOperator, we also provided the ability in the platform to use Spark cluster powered by Qubole.
Although the platform was decentralised, we still provided a centralised access to the data to avoid any confusion for our users to access it.

We implemented those ideas and provided a clear expectation that each of the data platforms would be owned by the tribe and data engineers would be there to support it from the technical standpoint.

D. Improve monitoring capability

As part of our pain points, we spent some fair amount of our energy to make sure that we monitored the right metrics, especially the actionable ones. We started by improving the services metrics. At the same time, transformation was also happening on the other part of the technology division and monitoring was part of it. Therefore, there was a clear standard on how a monitoring of a particular service should be built. We just followed that standard.

The other type of monitoring that we also improved was the data platform usage monitoring. As stated previously, it was difficult for us to monitor what jobs were submitted into our platform. We were blessed with the level of support from services that we used like Qubole, BigQuery, and Google Cloud Storage. They provided so many logs that we could use to build a monitoring dashboard or even a report. Since the start of using those services we had so many analyses that would turn into solutions on how to improve the experience of our users using the platform.

Engineering Best Practice

In order for us to transform successfully, we couldn’t just rely on the architecture transformation, we also needed to transform the mindset of the engineers. The most important thing was about how we developed, tested, and delivered the service to our users. With that in mind, there were several aspects that we should change.

A. Automated tests and refactoring

In my early days at Bukalapak, I stated clearly what I expected from the engineers, especially on writing automated tests and code refactoring. Since then, the change was gradually starting to happen. We would have less and less testable code without unit tests merged into the master branch. Engineers were getting more proactive to provide their suggestions in the code review, including the suggestions about code refactoring. Specific initiatives around code refactoring emerged slowly and were put in the sprint planning. I could see that the engineers really understood why we needed to do this as well as the benefit for the services and for themselves.

B. Automate other development operations

To avoid unnecessary repetitive operation or even human error, we needed to implement automations in many areas, one of them was the development automations. What I mean with development automations is all the automations that we need for us to move the changes after it is merged, get deployed to the staging environment and tested, and until it is deployed to the production environment.

With that we could just click a button to deploy our changes to the staging environment and got it tested with the automated integration tests that we already wrote. And another click of a button to deploy the changes to the production environment.

C. Reliable staging environment

Unit tests were not enough, we still needed to do some integration tests that would mimic the behavior of the production system. Therefore, we needed a staging environment that would support us to do so. It is not that we didn’t have a staging environment previously, but we just didn’t use it. One of the expectations from the change was for us, the engineers, to always make sure that the code was well tested in the staging environment before we could deploy it to the production system.

D. No more manual infrastructure operations

Another automation aspect that we also needed to improve was the infrastructure operations. Things like provisioning a machine, performing a maintenance operation on the database, or even decommissioning a machine, everything should be in code and executable. When they are written in code, then we could easily put it in a pipeline which we could easily trigger. This way we could avoid unnecessary human error when performing any infrastructure operation activities. Not only that, we could also test it first in the staging environment, just to make sure we already did the right thing before actually executing it in the production environment.

Better Communication

Another non-technical aspect that also took an important role in the success of this transformation was communication. We reset the communication quality between the data engineers and our users. We had a regular sync up which we used to share updates and feedback. Not only sync up, but we also built a clear communication channel for our users to easily report or deliver their feedback to us. We also added an additional clear expectation of what an on-call engineer should do which was to help address those issues/feedback/problems that our users shared through that channel. That attempt was very helpful to build the trust between data engineers and our users. Most importantly, it also helped us to get buy in from our users on why we needed to make this transformation.

How It Looks Like Today

The Current Data Platform Architecture — **Figure 2.** The Current Data Platform Architecure

Fast forward to the present time, figure 2 shows the high level architecture of how our data platform looks like now. It doesn’t change significantly from the component standpoint because we still need those components, but indeed the implementation has changed significantly. In the CDC part, we’re now using the custom made CDC that replaced debezium. It runs a lot more reliably now. We’re still using Apache Kafka as the message queue system, but in other parts of the system we also use Google Cloud Pub/Sub. As for the streaming system, we write the logic with Apache Beam and deploy it in Google Cloud Dataflow. It saves a lot of our time and energy to deal with the infrastructure.

For the data lake, we use Google Cloud Storage and BigQuery. For any data that our users want to immediately query, we store it in BigQuery, but for the ones that need further processing, we store it in Google Cloud Storage. For big data processing, we write the code with Apache Spark and run it using a cluster that is orchestrated by Qubole. It is proven to cure our headache in the infrastructure elasticity. Once the data is processed, we store the data in BigQuery so that it can be easily queried. For the real time data, we implemented a lambda architecture approach in BigQuery. It works really reliably.

The additional component in our data architecture is the tribe data platform. As I mentioned earlier, each tribe has their own data platform and they can easily access the data that we have in Bukalapak. Today, they are operating very independently and we no longer hear any problems around computation resources.

As for data visualisation, we still use redash to a certain degree, especially for exploratory data analysis. However, for the regular reporting we aim to use Looker as the default tool.

Business Impact

What does this transformation bring to the business?

With all the heavy transformation, I would say that would be the first question that might come into the mind of the business people. We invested a lot of energy, time, and effort. We also sacrificed some of our usual business needs in order to be successful in this change. What are the benefits for the business?

Speed: More Efficient Decision Making Process

By the time I write this, there are more than 55,000 queries being submitted every day into our data platform. From all those queries, 91% of them take around 1.5 minutes to finish. That is a huge significant change compared to the previous one, where it would only less than 50% of queries that could finish in the same speed. This improvement helped our users to move even faster now. They can make even a lot more decisions than they could before.

On the development side, the cycle is now even shorter and faster. Since there are a lot of automations in place, we can skip unnecessary manual operations that require a lot of time to finish.

Reliability: More Effective Decision

Prior to this transformation, we didn’t have any SLA that we could provide to our users. This transformation helped us to be more accountable since now we could provide 92% SLA of data readiness to them. What does that mean actually?

Since most of our data are processed daily, then by saying 92% SLA, we actually say that we can guarantee that we can provide the correct data in 28 days of a month. I can still see a very large room to improve there, but at this point, we are now able to provide a sense of certainty in what we deliver to our users.