Wrong Ways of Defining Service Boundaries
In the last post I wrote that to beat complexity you need to split the monolith. In this post I’ll describe the wrong ways of doing this. I can go even further and call the following ways an anti-patterns of monolith splitting, or anti-patterns of service boundaries defining.
The developers of one project that I’ve seen set the goal of making their service as reusable as possible. Part of the services were defined by all the nouns they could identify in their domain. These services’ interfaces were CRUD-ish: get-by-some-clause-queries and updates. Logic corresponding to atomic and reused operation was located in separate service that communicated to those “noun-services”, or “entity-services”. This “operation-services” were used in more high-level services corresponding to a business-process. They were also supposed to be reused, completing a developer’s paradise.
That’s how the air ticket booking business-process looked like:
Now let’s take a look at the flight date changing process:
Two operation-services are “reused” in these business-processes. Now let’s see how the system’s static cast looks when following this approach:
So what are the drawbacks here?
- Very tight coupling. If one service was changed you need to test the whole system.
- Such services are very fine-grained so there is a lot of internal communication.
- As a result of being fine-grained there are a lot of the services. System’s getting hard to understand, the queries are getting harder to track.
- Entity-services are poorly encapsulated: none of the business rules are checked there, this logic is in operation services. So any service can call any entity-service and update its data with common update-query that is present in its interface. Such kind of entity services are often referred to as data-centric — as opposed to process-centric, behaviour-centric services.
- The innate nature of communication between these services is synchronous. So the chances are that the transport chosen would be http. Hence all its drawbacks that I’ll talk about a bit later.
The goal of such split is reuse, but in practice besides aforementioned problems this reuse just doesn’t work. As soon as service “A” serves one more client with its own requirements and expectations the chances are that service “A” would require some changes.
Take a reflective look at your code and think how many problems the striving for reuse caused. Actually quite a few, so that it became a fallacy. Service reuse is a fallacy as well.
I’m a bit alarmed about what Sam Newman wrote in his book Building Microservices: Designing Fine-Grained Systems:
One of the key promises of distributed systems and service-oriented architectures is that we open up opportunities for reuse of functionality. With microservices, we allow for our functionality to be consumed in different ways for different purposes.
I really hope that he didn’t mean the architecture I’ve just described. Or at least I hope people understand those lines so that it won’t result in such an architecture.
Blurred service responsibilities
Each service must have the specific responsibility any developer should be aware of and fully understand. Otherwise the conceptual integrity is lost which is the main characteristic of any system, Frederick Brooks, author of The Mythical Man-Month, believes.
But the responsibilities are not always neat and clear. I see couple of reasons.
Before talking about the first one I should mention what business-architecture is. Here is what Wikipedia says:
Business architecture is defined as “a blueprint of the enterprise that provides a common understanding of the organization and is used to align strategic objectives and tactical demands.
Putting it another way, it is what the enterprise does, how the enterprise achieves it, what are the communication paths between organisational structure units, what are its driving forces, what are its business rules, business policies.
So one of the reasons of enterprise’s lost conceptual integrity is blurry business-architecture. It’s just hard to identify the specific service responsibilities under such circumstances. As a result strictly technical approach is applied which has little chances to meet the real business requirements. So it inevitably leads to blurred service responsibilities.
The second reason is that IT department might know nothing about business-architecture, provided it is actually defined of course. This common problem is a consequence of poor communication between business-guys and IT-guys.
And finally IT might be aware of business long-term plans and strategies but nonetheless they think it just doesn’t matter, that it doesn’t affect the technical service boundaries.
All this contradicts to the Holy Grail of any enterprise — Busines-IT alignment. What are the consequences?
Let’s take a look at over-simplified image that any enterprise’s CEO can easily conjure up in his head. It represents the communication of the main parts of the enterprise. Actually there are plenty of ways to visualise business-architecture — I’ve just mentioned one of them.
Now let’s take a look at IT-architecture expressed via technical services:
Apparently their boundaries don’t coincide. Business-architecture services and IT-services have different underlying data at their basis, different driving forces, different functionality, different communication paths. What happens when (when, not if) CEO says that functionality represented by “Business-functionality 1” is going to be outsourced? So we’ll face the problem of splitting the technical service Technical functionality 1”.
And what happens when we ever gonna need to make some changes in the area marked in the image? Yes, right in the border of two technical services?
It will lead to tight coupling between “Technical functionality 2” and “Technical functionality 3” right in this area. At the end of the day our distributed system turns out to become a distributed monolith.
Services with synchronous communication
Communication is considered synchronous (or blocking) when the service sending the request is waiting for its completion by the service receiving the request. Why is it bad? There is the whole bunch of problems.
It’s not reliable
If only one service is down then the whole system is down.
One can even count how many times the system’s getting less available depending on the services count — if no additional measures are taken of course, i.g. network errors are not caught and the requests are not completed.
Let’s put the probability of one service being down in a system containing N services is P. The probability that none of the services are down and hence the system is up and running is (1 − P)ᴺ. So the probability that at least one service is down is (1 − (1 − P)ᴺ). For example, when there are two services, i.e. N = 2, the division of probability that system containing 2 services will fail and probability that system containing 1 service will fail is
(1 − (1 − P)²) / P = 2 − P. Provided P is small we’ll get that the system will be unavailable twice as often as before split. In general for the number of services N the function of “failure ratio” depending on the number of services looks like f(N) = (1 − (1 − P)ᴺ) / P.
Here is its graph for P = 0.01%. As we can see, the dependency is almost linear for feasible N:
The system might end up in inconsistent state
Well, there are two options here. Either you fight it or deal with it. In case you wanna fight you can use distributed transactions.
First let’s take a look at the classics — 2-phase commit transactions. Here are its drawbacks:
a. 2-phase commit transactions are inherently brittle. For example one of the participants of the transaction might fail — moreover, on any of two phases. Or there can be some networks problems between coordinator and one of the participants. As a result the distributed transaction will hang. You can of course solve this issue with timeouts, but how to choose it? It shouldn’t be too small as transaction needs some time to be processed. It shouldn’t be too big either as we don’t want our whole business-operation hang. So this is the fundamental problem of 2-phase commit transaction: it’s impossible to distinguish failure from long processing time.
b. The availability is getting less. If one of the sub-transactions is not successful then the whole transaction is not successful as well. And, as a result, the business-operation is not successful either.
Not even mentioning that 2PC-coordinator is the single point of failure.
c. It is an additional operational complexity. Ask your system administrators about that.
d. Network communications are growing. Firstly, because the communication with participant databases goes through the coordinator. Secondly, there are by definition more than one database in a distributed transaction. And thirdly, any 2-phase commit transaction apparently has two phases. As a result total throughput is decreasing and latency is increasing, which creates scaling problems: provided that you don’t modify the way that concrete functionality operates (Y-axis of scalability) — I mean you cling to 2pc, and this issue has nothing to do with partitioning or splitting anything (Z-axis of scalability), then you’re left with the only one axis left — the cloning one, X-scale. And it’s completely worthless here as well as no matter what and how you duplicate, the latency is still there. It might go down if you buy some expensive servers, but you can’t solve this issue by X-scaling forever.
e. Resource lock is obtained during the first phase and is released in the second. As a result the lock lifetime is a several times as long as when you don’t use distributed transactions. Besides that, 2pc-coordinator might fail at any time. It might fail during the first phase. And if it does, all the resources that were locked would stay locked. It might fail during the second phase. In this case all the resources that were not released due to non-commited transaction would stay locked as well. And locks are one of the major scalability problems because no other transaction can operate on locked resources. Let’s imagine that you have a command that locks some resource for 100 ms. Thus maximum amount of such commands can’t exceed 10. Growing the number of machines up further just doesn’t make any sense.
For the sake of justice I should mention other protocols that are used in distributed transactions. Further I’ll list their advantages comparing to 2pc transactions.
If you use 3-phase commit transactions any participant can complete the distributed transaction. So we spare ourselves from the main concern in 2pc transactions — single node failure blocks. It is achieved through the use of an additional step, precommit phase. After that all the participants are aware of the result they’ve voted for. Thus the problem of forever locked resources is solved. This is good for scalability but latency is increasing as there three steps, not two.
The problems might arise in case of network partition though. Consider the situation when all the participants who received “prepare to commit” message would be in one partition and all the participants who didn’t receive that message would be in another. Each of these partitions either commits the transaction or rolls it back. And as they don’t know of each other the chances are that transaction in the first partition commits and in the second one rolls back, or vice versa. As a result the system’s state would be inconsistent.
And Paxos is the one who solves this problem. Here is where it differs from the previous protocols:
a. Availability is better. It is enough that most of the participants, not every single of them, would acknowledge an agree message. The same is on the second phase. There is no single coordinator as well. It gives the following advantage:
b. No forever-locked resources.
Here is a nice description of these three types of distributed protocols.
Among other things some works are being conducted directed to minimization of the number of exchanged messages providing high consistency using distributed transactions.
If you don’t want to deal with distributed transaction you should write the request resending logic, probably circuit-breaking logic and most probably the logic of rolling back the actions that happened in available services. The chances are that under this circumstances when you don’t rely on system’s resilience, on the contrary, you expect the failure, this rolling back logic would be business-driven. You will handle errors understandable to business-guys, not just purely technical errors like “service unavailability”.
Synchronous communication is resource-intensive
First service receiving the request waits for a response of the second service that wait for a response of the third service etc. So the first service waits as long as the rest of the services do their job. The situation will get even worse if some service slows down. As a result the whole system is slowing down and its throughput decreasing dramatically as available resources just run out.
Since communication is synchronous you need to scale every single service. And the services receiving the request at system’s entry point wait the longest as I shown in previous chapter, so they need scaling more than other services.
8 fallacies of distributed computing
I should’ve mentioned it anyway but I won’t delve into this, I just give the floor to Arnon Rotem-Gal-Oz who made the detailed review.
Services with asynchronous communication
Such services solve only technical problems inherent to synchronous communication such as resilience and resource cost. But there are some more serious problems such as logical coupling. It is exactly the same problem command messages suffer from, and let me talk about that in the next chapter.
Services with command messaging communication
So hopefully we’ve found out that synchronous communication usually implemented with http is bad. Let’s see what messaging infrastructure has to offer. In their book, Enterprise Integration Patterns, Hohpe and Woolf talk about three kinds of messages: command message, event message and document message.
Command messaging lacks the aforementioned drawbacks of synchronous communication and it is very close to the mindset of those who is used to it. But command messaging alongside synchronous and asynchronous communication have one problem in common: the resulting services are tightly coupled.
When service A tells service B to do something, service A apparently is aware of service B. So if service A would have to tell service C to do some job surely we’d have to modify service A.
Besides that service A expects from service B some behavior. Based on the very nature of such communication service B performs some job in the context of service A. So if requirements to service A would change the chances are that we’ll have to modify service B as well.
Now let’s put that service D wants to use service B’s functionality. But service D has its own context and its own requirements to service B. Very likely service B would need some modifications to satisfy service D’s requirements. After they are completed we need to make sure that changes didn’t break service A’s functionality.
Sometimes a system might take not such a perverted form as in chapter “Wrong reuse” but similar in its spirit. The point is that all data relating to some (usually noun-like) concept is located in a single place, in a single service. An attempt to decentralize it, to place it into the services that operate this data, the services that are truly the owners of this data, is considered vicious. Very often I hear an argument that it violates the Single source of truth principle. Or, “Why should I split the data that is related to one concept?” But usually these folks don’t even know what they are talking about as the principles of correct service boundary identification still apply. I’ll talk about it a bit later. For now though let’s take a look at the downsides of such approach:
- When changing the service logic with centralized data the chances are that you’ll break other services requesting it. And usually there are quite a few of clients.
- When one service is responsible for data storage, creating, updating and representing data logic for a lot of services the complexity hits the ceiling.
- These service are inherently synchronous as they are often used for data providing. Why this is an antipattern I wrote earlier in “Services with synchronous communication” chapter.
- Very likely you’ll need to update several entities in one business-operation. And the chances are that you’ll end up with distributed transactions spanning several entities, hence several services. I wrote about them earlier as well.
- Pretty much the same point — very likely you’ll need to get data related to several entities. And it is very likely that they will be located in different services, which results in chatty communication.
I think a good service and a good class have something in common. On of the common traits is data and behavior put in the same place. This leads to impossibility of random data mutation bypassing the behavior provided by the class’s interface. The exact same principle applies to a service.
By centralizing our data we CRUD-ify the service’s interface. We split the behavior and its data, turning a centralized data service to a database. We ignore the complex system design best practices established since Smalltalk days, dooming ourselves to procedural programming hell, but system-wide.
A vivid example, an edge case of such a system is a monolith. Usually one concept is represented in one class, in one database table. So there are a lot of independent scenarios when you need to mutate the same database record — physically, but not logically, the same unit of data. In a monolith where everything is intertwined this can quickly get out of control. This results in scaling anemia. No matter how many application nodes you add, how many db-instances you add — it just doesn’t make any sense as a certain piece of data is locked and requests just can not be processed in parallel.
Orchestrating service looks like a class with a lot of dependencies, but each of these dependencies has only one implementation — invoked service itself. As a result this service is tightly coupled with all the services it interacts with, resulting in a chatty communication, very likely implemented with synchronous request-reply.
What happens to this governing authority when the number of services it orchestrates will raise to ten? Twenty? One hundred? Business-logic inevitably will leak there. The very fact that this competence center decides what service to invoke and when already speaks of this. Do we really need such a complex service where the cost of error is too high? The error that can put the whole system down! From the other side, if we make it deliberately simple with no business-logic at all, the valid question to ask is — what do we need it for then? So we find ourselves in a catch-22.
Besides that, the logic is inevitably spread over two services: the central governing service and invoked service because of a problem I’ve just mentioned — synchronous communication nature between them. For example if it happens so (quite likely by the way) that you’d need to add some service you’d have to change the competence center. Here is a great demonstration of that.
Actually such a scheme resembles me of smart pipes — kind of communication that Martin Fowler stands against. After all, the single purpose of such a service is message routing, but by definition it contains some logic. Not good.
I let Nic Ferrier to hit the last nail in the orchestration coffin.
System-wide Finite state machine
Usually this approach goes hand-in-hand with orchestration. The point is to define a separate service that tracks some entity’s state, or bunch of entities’ state. So this service is aware of current entity’s state. It is aware of what state can be next, and what state can not. So the behavior of such service is in changing entities state as a reaction to inbound request, be it a synchronous or an asynchronous one, a message or an event. So all other service do their job that is initiated by FSM-service and inform it upon finishing — instead of following a certain subset of all possible entity states themselves.
Sometimes it is really close to orchestration, so another names that I use for such kind of services are system-wide Process-manager or system-wide workflow.
Pretty often this approach combines with examples I gave in “Wrong reuse” chapter when part of the services are defined by nouns, and in “Centralized data” chapter where such centralized data service might be a global Finite-state machine, or a global Process-manager.
I assume that at the end of the day correct system boundaries can take the form close to the aforementioned one. But it is absolutely wrong to start identifying service boundaries from this perspective.
Defining service boundaries along organisational structure
It makes sense to use an organizational structure for finding service boundaries only if this structure has clear responsibilities. On the one hand, an organizational structure was created to effectively solve business problems. On the other, quite often selfish interests of conceited people interfere, bunch of opposing clans appears, backstabbing takes place, intrigues are all over the place, etc, i.e., this is politics. Besides that, the business units might change, some new units might appear, some might got rid if, while an organizational structure often stays the same or has a lower pace of change. Moreover, some lines of business can be outsourced. So service boundaries based on this brittle, ever-changing characteristic will not be stable.
Taking that into consideration we should not start with blindly defining our service boundaries along organizational structure, though it can give us some clues.
Keep in mind Conway’s law, use it for your own benefit. Understand your domain, identify its communication paths, change an organizational structure if you need and if you can. And so be it, only after that define your service boundaries based on organizational structure.
Defining services around layers
The main goal of splitting the monolith on services is to make the whole system less coupled. Firstly, if we divide our system on services around layers we won’t make it less coupled: the communication volume between services won’t change, but it becomes through the network. Second, from the organizational structure point of view, that doesn’t seem smart as well: if you want to add a field in a UI-form the chances are you’ll need to modify all the layers: UI itself, controller, business-logic and data storage. Now providing that your teams are aligned around services (which is, in case of correct service boundaries, a good idea) consider the amount of cross-team communication and probability of misunderstandings. It’s significant.
In general, main characteristics that classes and services must possess are the same: they both should be loosely coupled and highly cohesive. So assuming the inherent to layers tight coupling I won’t even try to divide them. There is no sense in trying to divide undividable. Think instead of vertical slicing. Following this approach you get cohesive pieces of functionality with all layers in them, from UI to data storage.
- Don’t start with reuse in mind.
- Get to know your domain.
- Don’t do synchronous communication. If you think you really need it, chances are that your boundaries are wrong.
- Don’t do asynchronous communication.
- Don’t do command messaging communication.
- Don’t centralize you data.
- Don’t orchestrate your services.
- Don’t start identifying your boundaries with Finite-state machines in mind.
- Don’t trust an organizational structure, it might be flawed.
- Don’t create your services around layers.
In order to understand what criteria you should use to define your service boundaries I’ll first talk about characteristics I want my services to possess. Don’t miss my next post.