User Action Service: from the Monolith to the Cloud

Published in

BlaBlaCar

7 min readJun 23, 2020

Introduction

Historically, BlaBlaCar’s backend was a big PHP monolith. It allowed us to iterate and build features quickly, but in the long run, it was not easy to apprehend, maintain and evolve. In the last few years, we have been breaking down this old monolith and moving toward a Service-Oriented Architecture.

This article is one of the stories of how we are managing this transition, with some details on the tools that helped us along the way. It will cover the technical stack choices, the general architecture and feedback on the tools we used.

I — The whys

As BlaBlaCar’s activity expands, having a monolithic architecture becomes more and more limiting, mainly because:

As they live, monoliths fall into coupling;
Having teams working on separate backlogs is hard to organise on the same code base;
Monoliths are hard to apprehend for newcomers;
For all the above, any new functionality becomes extremely costly.

This is why, in the last few years, we’ve been redesigning our platform to an SOA architecture, looking for:

Functionally scoped servers: a service manages a single functionality and is fully responsible for how it implements it.
Loosely coupled services that can evolve independently. Teams can work and deploy at their own pace.
More code ownership, so services have a defined and more reduced scope, allowing teams to maintain and operate them with more ease.

II — Scope and history of user actions

The latest in line to be made into an independent service were the user actions. They help our Community Relations Team diagnose and solve users problems, like giving refunds and detecting anomalies, or tracing fraudulent behavior.

The amount of user actions, as you might expect, is colossal. Historically as one SQL table, this part of the code had performance issues. We found solutions every time, but with our expansion, problems got bigger and bigger:

One first solution was to create indexes to make queries faster;
Another was to shard the database into multiple tables.

Still, these solutions reached their limit and with time, we started getting performance issues.

III — Data storage

1 — The road to NoSQL

The choice of the data store was mainly based on technical requirements:

There are huge amounts of data
Large volume of inserts
Multiple reads per second, with the need for fresh data and performance
Possibility to filter in a reasonable time

We were obviously looking for a NoSQL database, as the distribution would allow us to store and manipulate large amounts of data and scale when needed.

2 — Managed service

We also wanted to go with a managed service to reduce the cost of the maintenance. Distributed databases like Hbase and Cassandra are famously known for the hardships they procure to the database management teams. More importantly, for the applications development team, we wanted to focus on producing the logic of handling user actions rather than spending our time fixing problems with the database infrastructure. It was also the opportunity to test a new product, as the scope of the application is not too wide. We wanted to give something new a shot!

3 — Final choice

The choice landed on Bigtable for the following reasons:

Volume: It was designed to handle big amounts of data, unlike most SQL databases, an instance can handle 10.000 inserts per seconds, so we chose the minimal configuration with only three nodes.
Scalability: It’s scalable and easy to replicate, you can also manage data versions.
Management Cost: It’s a managed service.
Performance: It’s a GCP service, which will allow us to have excellent performance as it will be in the same infrastructure as the service deployed in GKE.

III — Jumping from one moving train to another

The hard thing about moving from a monolith to a service is to provide continuity of service from an external point of view and to have no data loss.

To provide this we decided to run the new service, keep the old system working, and write the new actions via REST in both of them. The next step will be moving the historical data from the monolith to the new service. This operation might create doubles, so the initial data load should handle deduplication. A third step will be to read from the new service instead of the old system. A last step would be to shut down the monolith part and the legacy SQL database once everything is working in the service.

IV — Moving Data

1 — From MySQL to Bigtable

Now that we decided what our target datastore will be and how it will be deployed, we started thinking about the ways to move data from the monolith to the new database. Such an operation is critical and the migration job should be:

Simple and easy to develop: the complexity being the databases connectors
High-performing: as it will read huge amounts of data, we want to have the result as quick as possible
Reliable: we wanted no data loss
Easy to restart: in case things go wrong, we wanted something we could simply restart without a lot of manual actions, with shell scripts everywhere and a complex procedure
Allowing transformations: we needed to join data from multiple MySQL databases and tables to obtain the final Bigtable ones, this could be hard if scripted manually.

2 — Apache Beam and Dataflow

We chose to use Dataflow, a distributed managed service provided by Google Cloud, that allows people to move data in Batch and in streaming modes.

Apache Beam on the other hand is a Framework used to code distributed scripts. You can then run it on many distributed engines like Dataflow, Spark, Flink… In other words, it’s an abstraction layer to define your transformations that can be understood by multiple engines.

Another cool thing about Apache Beam is that it has out of the box connectors to multiple data sources, among them both MySQL and Bigtable. So we created the data transformations, reading from Mysql legacy tables, joining them and putting the result in Bigtable.

2 — How did it go ?

We made a few observations while running the migration:

It was an exhausting task for MySQL: to avoid performance issues, you must by any means avoid running it on a critical database. To avoid this problem, we ran it on a replica database made for such tasks. We followed the execution on our monitoring service (Prometheus/Grafana) that showed Dataflow parallel machines exhaust MySQL.
It was not easy for Bigtable either: we had a small instance of 3 machines, able to give 30.000 inserts per second. If you run multiple Dataflow jobs you go much beyond that. So, think wisely about how many machines you want to have in parallel, because having the maximum number will make you lose money.
MySQL Beam connector does not stream data: MySQL connector executes a SQL statement, and waits to load the whole response to continue. This will give you problems with RAM if you load a lot of data. So you need to test and define the ranges of users to avoid ending up with a memory exception.

V — The Backend Service

Multiple architecture patterns were possible, we wanted to go with an architecture that can evolve in the future, and that we can build step by step, with live traffic from the first step.

We decided to create a service with Spring WebFlux, the reactive framework, to have non blocking requests and then better performance with functional endpoints. We used Reactor for the best integration with Spring.

1 — Integration testing with Bigtable

Integration tests call the API with multiple payloads/headers and check if the response is as expected. To do so, we needed a Bigtable instance behind our service, as we wanted real life tests.

This instance needs to be cleaned on each run to start from the same state each time.

The solution was to use the Bigtable emulator present in the Bigtable client SDK. The SDK is very rich and has more features like admin management and monitoring.

2 — Deployment and infrastructure

Our GCP infrastructure is on top of Google Kubernetes and Istio. We manage the whole thing with Helm and Flux. Other articles are coming on our blog to go into more details about our infrastructure!

Of course, our service had classic monitoring, logging and alerting ends, without them we wouldn’t be able to run it go live, based on Prometheus/Grafana and ELK.

Final thoughts

In order to manage the large amounts of data, we went to Bigtable thanks to its ability to scale. We had multiple challenges like defining the data schema with Bigtable keying pattern, moving large amounts of data from a SQL on premise database to a managed NoSQL, testing and monitoring the whole thing.

At first we deployed the service on our on-premise datacenter, and by moving to GCP we divided the latency by three. Our Community Relations Teams now have quicker response times and are thrilled.