Splitting the monolith with PHP and Kafka

10 min readApr 17, 2024

In this article, I will describe the experience, that our team had when we split the monolith into separate independent services.

Intro

Let me start with a few words about our application. We developed an application with both web and mobile interfaces. The web interface serves as the initial gateway for a new user. After making the purchase, the user is expected to install the app on his mobile device to use our product.

From the backend perspective, our application was a single monolith, written in PHP & Symfony Framework. The rest of the stack is more or less traditional — database, cache layer, and some cloud infrastructure.

After several years of development, our software team expanded significantly. Product managers were generating new features rapidly, causing our repository to become bloated with pull requests. This resulted in frequent git conflicts and delivery delays.

There should be a reason to split the system. And probably not a 100% technical one, but rather operational.

Image by AI illustrating a large dev team, that works in one repository

How to split

So the decision was made — we need to split that in chunks for our teams to be more flexible and independent. After some investigating, the final solution was ready — we will split it into large chunks not by product features, but by the way how users communicate with our product: the web interface and the mobile interface. So, in a way, it was a “macro”-service architecture, because we split the monolith into two large parts.

Splitting the system must solve a business problem. Therefore the developer team must take that into account when deciding how to split it and what the resulting services will be.

Technically, we needed a message broker to manage business events and subscribers (the parts of our system). While RabbitMQ might have been a popular choice, we required our messages to be processed in a particular order within the context of each user. For instance, the updating of user data must be processed only after the creation of the user, the rebill order must be processed only after the initial subscription order is created, and so on. Additionally, we wanted the flexibility to consume business events by other parties. For example, the analytics team would be interested in processing this data as well.

This led us to a solution involving the Kafka Cluster. The problem was, that we weren’t satisfied with existing implementations of the Kafka client in PHP, because we desired maximum control over it. So we ended up with the decision of implementing our own Kafka client with PHP and Rd-Kafka extension.

Implementation details

Overall the implementation of the Kafka client was fine — it is always fun to write the code. The implementation of the client consisted of two parts: the library itself, which is agnostic of any PHP framework (and also of any underlying Kafka driver), and a Symfony Bundle for it.

Investing time in developing in-house solutions is a good thing. Spend one month more and get a fully controlled, known tool.

Problems arose when we started to test it in practice. We faced performance and infrastructural issues with the Rd-Kafka driver when we published events. For example, Kafka Cluster became unavailable for a certain amount of time, while it was being updated. Therefore, we couldn’t guarantee that database changes will be consistent with events in the broker. We didn’t find many examples of usage of this library to publish and consume events in production using PHP. Therefore, it was problematic to come up with a straightforward solution.

Transactional Outbox

To address the mentioned problems, we utilized the Transactional Outbox pattern. Instead of publishing events directly to the broker, we replaced the Kafka producer implementation with the Database one (in our case — a Doctrine implementation). This resulted in insertions of a serialized event into a special table outbox, that had topic metadata for columns.

This approach was much more comfortable for us for different reasons. Firstly, we could control the transaction in a database context, just like any other write operation within our business logic. Secondly, we (meaning PHP) didn’t rely on any other third party (except traditional monolith stack) when publishing the event. And, of course, it was much easier for the developers, because it was just simpler. What could be easier, than just an extra insert in a simple table, right?

The decision to publish to DB instead of Kafka.

Kafka Connect

The new problem was, on the other hand, how to fetch these events and push them into the Kafka topic. We used Kafka Connect for that — a Kafka plugin, that connects to a source of data, fetches it as soon as it updates, and publishes it in Kafka topic. In our case, it was Debezium Kafka Connect.

To use this approach, we needed to create the “connector”, that will read the outbox database table. Connectors are configurable entities that establish connections to the data sources and perform the pulling of the data. They can be managed through the corresponding REST API, allowing you to create, read, update, or delete the connector.

There are various connector implementations for different database platforms. In our case, we used a MySql connector. It uses the MySql bin log file to sync the state of events in the database and those events that were pushed to the Kafka topic.

Publishing events to Kafka topics through Kafka Connect.

Kafka Connect is a very “sensitive” tool. If anything happens to a bin log file, then it can be very problematic to solve the issue, because Kafka Connect loses track of which events to push from the database to the Kafka topic. Additionally, one has to pay attention to the configuration of the connector itself. For example, if you change anything in the database connection URL, you will have to update the connector configuration too. Otherwise, it would no longer be able to connect to the database. Furthermore, there are different strategies how to create so-called “snapshots” of the database table, which are crucial to the data integrity and consistency. The configuration of Kafka Connect and connectors is stored as a message in a special Kafka topic.

If you decide using Kafka Connect, it is critical to study how exactly it works and how to fix the issues.

Failure handling

Of course, we needed to handle any failures that could happen within the communication process between services. There are two types of failures.

Failure to produce events from the service

That means, in our case, that service couldn’t perform an SQL Insert into the outbox table. Why would that happen in our case? Mostly, it was the violation of the JSON validation constraints, that were applied via the Schema Registry tool to the events through our communication contracts. It can also mean some infrastructural problems related to the database availability.

If we are unable to publish an event to Outbox, then we will save it into a special table, that holds failed Outbox messages. It contains the raw JSON payload, target topic, timestamp, and other useful metadata. Dealing with this table is the responsibility of a developer team — manually. They investigate the alerts, logs, event contents, and whatever is needed to get an understanding of what had happened. We also developed the corresponding CLI — a set of Symfony Console commands, that performs popular actions on these failed messages. For example, the developer could try to republish it to outbox as is or transform (and therefore fix) the message. This way we could track failures of message publishing and republish them if we fixed the issue.

Failure to consume the event by the service

The consumer in our case is a Symfony Console command implementation, that runs the Rd-Kafka consumer on the specified topic. If it fails (for instance, a RuntimeException is raised inside the handler), then each subsequent event, that falls into the same partition, becomes stuck and can’t be processed. This is a serious problem as if the consumer fails to process an event a user who is somehow inconsistent with the app state, then the next ones, which are fine, cannot be processed.

We decided, that for each topic there will be a Dead Letters topic with a separate consumer group. If some consumer fails to process the message, then we catch the exception and re-publish the original event to the DL topic, where it will be picked up by the DL consumer.

Republishing failing event to the DL topic.

This approach enables us to avoid blocking consumption in the original topic and “redirect” problems to DL topics and consumers. Because the DL consumer runs the same logic, 99% of the time it will also fail, and then the developer team must decide how to fix it, to make the DL consumer process the event.

The issue with that approach is, that we have to consume events in particular order. Redirecting only one event to the DL topic will ruin that. In this case, we have to send to the DL topic all the consequent events, that have the same partition key. In our case, it is user UUID, which makes sense — there is no point in trying to process, for example, “user-update”, if a previously published “user-create” failed. So, the “user-update” will just sit in the DL topic and wait until the “user-create” is processed.

A developer can change the behavior of the DL consumer just by injecting a new config to the container (thanks to our in-house library and bundle). So in the end, to fix an issue, we could do it not by changing the existing code, but by appending it — i.e. injecting new processing logic in some consumer. As a developer, I was really happy about that approach.

This way we can keep the primary topic healthy and not block it with random failing events. However, if we encounter a situation, where every event cannot be processed, then we will stop publishing them to the DL topic. For the dev team, it means that something is broken in the whole system and not for some particular user. In this case, it is expected, that the primary user topic will be blocked and there is no point in trying to republish everything to the DL one. We achieve this by simply counting how many events we have published to the DL topic.

To sum up

It turned out, that these two flows of failure handling were crucial to migrate to microservice architecture. Along with these failure handling mechanisms, we developed a mini-framework, that consists of monitoring any failures, CLI commands to deal with failed messages and even human step-by-step instructions on how to fix issues. That saved our lives when we migrated the app to production. We had failures but they were fixed quickly and easily, without any major impact on the product or our mental health.

Developing a robust failure handling system is a must. That should be addressed very seriously.

How to onboard the dev team

The team, which did the research, development, and migration to the microservice architecture consisted of three developers, when the overall quantity of developers was five times larger. Of course, we needed to onboard other developers, because they were busy developing business features for the product at the time and could not be fully involved in the migration.

Documentation

This is a must. And not only just describing the system, but providing real cases from practice, that are expected to occur, and how to approach them. To do so, it is better to write documentation in the process of developing new mechanisms and applying new tools. Things are forgotten real quick.

Workshops

We also conducted a couple of presentations and workshops, where all critical information was delivered. These sessions were recorded to use them as a guide. This is also important because the migration team received some fresh ideas on the architecture.

Of course, these are more or less common things in a tech company — to create documentation or to share knowledge. But in our case each member of a dev team needed to be as confident as possible, dealing with new items in the tech stack.

Invest time to onboard all the dev team members to the new architecture.

Consequences

So, we successfully migrated to live with our micro/macro-service architecture. One dev team was split in half and each new team received a clone of a previously common GitHub repository. It was really fun dropping enormous amounts of code that is not used anymore by each team. I don’t remember that we had some critical issues with the new architecture. Every potential problem was previously listed, researched, and mitigated. From time to time our consumers failed, but having our failure-handling framework, it was not a huge issue for us. Approximately 50% of the time the issue was on the database side and could be fixed just by fixing up data with some follow-up tickets to avoid the problem in the future.

It is fun comparing how stressed we were the first night after the migration with how we reacted to problems after six months. The first night, when we migrated, those who preferred to go to sleep late — stayed for two hours more, and those (me) who preferred to wake up early — woke up two hours earlier. Just to check on the metrics and potential errors (there were none, by the way). After half of the year since the release, our largest problem was some typical occasional errors. The amount of stress over that was equivalent to any other error. We have successfully covered all major “what could go really-really-wrong” situations.

Image by AI illustrating two happy independent dev teams.

The greatest part was, that when we launched a new product. We already had the architecture that was developed to fit our dev teams. So yeah, we launched a new product already having the “microservice” architecture underneath, because it was already normal for us to work this way.

Psychologically it can be tough for developers to make such a change because they start dealing with new technologies in production. However, eventually, it becomes common as, for example, working with MySql or Redis.

I hope that the article will be helpful to other teams, who are considering splitting the monolith. Cheers!