The story of a micro-service transformation in Hepsiburada

Published in

hepsiburadatech

14 min readDec 21, 2020

We launched the new checkout ecosystem “Erebor” after a few months ago in Hepsiburada. In this article, I will talk about what we have experienced during in this micro-service transformation.

“A software ecosystem is the interaction of a set of actors on top of a common technological platform that results in a number of software solutions or services.” Software Ecosystem: Understanding an Indispensable Technology and Industry by David G. Messerschmitt and Clemens Szyperski

I prefer to use “ecosystem” as a keyword to describe such a large structures because of their depth, capabilities, and scope. This is exactly my purpose of using this term in a few places in the rest of the article.

We work as vertical product teams in Hepsiburada. Obviously, all of these product teams are working autonomously as an independent startups, and together with all these teams we are creating Hepsiburada. The Checkout team is just one of thirty product teams in Hepsiburada. We try to make our customers’ lives easier by managing the products and technologies behind the process which is starting from the basket flow until completing their orders. We divided our monolith checkout system into nearly 40 micro-services with the Erebor. I will try to share with you what we learned in this journey in three chapter.

CHAPTER ONE: DEFINITION OF THE PROBLEM
CHAPTER TWO: CHOOSING THE RIGHT TECHNOLOGY SPECS
CHAPTER THREE: THE COMPLEXITY BEHIND THE SIMPLICITY

CHAPTER ONE: DEFINITION OF THE PROBLEM

In fact, the first step of all good solutions starts with defining the problem well.The “ultimate” correctness of the solution depends on the exact definition of the problem.We gathered our problems under the four different topic based on this fact.

There is no bridge between the problem space and the solution space
Different loads in same spot
Survive in dependency hell
Stuck at the borders

We improved our entire ecosystem with the solutions of these topics.

There is no bridge between the problem space and the solution space

The problem space contains all stakeholders of a product. These stakeholders have comprehensive knowledge of their products. They can produce many non-technical solutions related to the their products. On the other hand, the solution space is a space in which engineers are involved in different roles and produce technical solutions. The productivity of our product and the quality of our code-base have been adversely affected over time because of this complex communication problem. Also, the complexity has become the standard for us rather than an option against simplicity.(unfortunately)

DDD is a bridge between problem space and solution space

So, We used “DDD” to solve this complex communication problem. In fact, we could preserve our code-base for a certain period of time by implementing different refactoring methodologies or an extreme programming paradigms. All resistance mechanisms will lose their efficiency over a time as long as there is no common language between these two different spaces. Actually all resources about DDD lead us to the following truth.

DDD builds a solid bridge between these two spaces with using the ubiquitous language.

There are many ways to design the ubiquitous language. We chose “the domain storytelling method” invented by Stefan Hofer and Henning Schwentner because it uses imagination and stories to simplify even very complex problems.

The best way to learn a language is to listen to other people speak that language. Try to repeat what you hear and mind their feedback. Gradually, you will progress from individual words to phrases and to complete sentences. The more you speak, the faster you will learn.

We established four golden rules to keep it all on track when applying DDD.

Learn the business first and than design the domain.
Write or apply your code like you speak.
Don’t worry about the Tactical design, focus on strategy design first.
Use fitness functions as much as possible.

We learned that mistakes made in tactical design could be resolved quickly. However, the mistakes made in the strategic design unfortunately caused the anemia of the domain.(anemic model). Also, We have implemented as many fitness functions as possible in order to keep the system robust. We are currently trying to integrate the concept of “Fitness function-driven development” into our system.

As a result of all this, we divided our system into four main domains.

Basket
Delivery
Payment
Snapshot

The Domain Design (sample)

First of all, we designed our ubiquitous language. Then we determine our bounded-contexts for each domain.

After deciding on our bounded contexts, we modeled our communication standard with these contexts via context mapping tools.

After strategic design, we started to use tactical design tools. We have defined our aggregates, entities, value Objects, repositories, factories and services.

Different loads in same spot

Our Monolith checkout service was used by about 10 different teams.At this point, all availability problems in our service were directly affecting these teams. The metrics we have collected here have led us to the following conclusions.

We have large difference between the number of reads and writes. We must scale independently both sides.(10/3)
The performance is critical. We can optimize read and write sides independently. Also, we support a low of parallel operations on the same set of data.
We must normalized the write database .We must make writes efficient but We don’t need the normalized data in read side (projection-per-client/projection-per-business)

Based on these data, we decided to implement the CQRS with Event Sourcing. As you know, There are three types of CQRS.

Separated class structure using domain classes for commands and DTOs for returning read data, which will introduce some duplication.
Separated model with different APIs and models for reads and write respectively. In addition to optimized queries this also enables caching, making it interesting for high load on reads. )
Separated storage optimized for queries enabling even more scaling of reads and separate types of storage for writing and querying, e.g. a relational database and a NoSQL type. Synchronization of read storage commonly runs in the background causing eventual consistency on the read side. Together with the best scalability this pattern also brings the highest complexity.

We chose the the separate storage strategy because it solves our problems more clearly than others. But you know, ‘Everything in life is a trade-off.’

We recommend that you research the CQRS myths before applying the CQRS. But it’s worth sharing what Greg Young shared on stackoverflow.

We had to find answers to some basic questions while implementing the CQRS with Event Sourcing.

Did we need to apply to all domains, really?

No. In fact, this solution fully defines “The Complexity Behind The Simplicity”. If we applied this to all domains, we would have lost our main motivation, “simplification”. Also, it will not be sustainable for all domains due to its cost. We have decided to implement this solution only in the “basket”, which takes the most load and needs this solution the most. However, we applied CQRS with the Separated model with different APIs in some domains considering the cost, feasibility and sustainability.

How were we going to prevent losses that could occur in events?

The processing of data without any delay is a very critical issue for us.Our system is definitely not suitable for an eventually-consist solutions.We could use Microsoft Distributed Transaction Coordinator or Transactional Outbox Pattern to prevent losing an event. Also we could distribute the load in a balanced and consistent way with consistency hashing. However, we did not choose them because they caused minor delays. We use 2pc to protect the integrity of an events.

How would we ensure data consistency on the read and write side?

In fact, we obeyed Greg Young on this step. We used hash codes to achieve the above mathematical function. In other words, We have ensured that the hash codes of the projections we created by combining our events are the same as the aggregate. Here, we have created different solutions to avoid some exceptional situations as follows.

— We have stored the summary of events as metadata to ensure data integrity. In this metadata, we store the number of events on aggregate and the hash code of aggregate.

— We re-build the events in case the projection data is not available.We check the data consistency and reliability of new projections by comparison with metadata.

— If the data is still unreliable, we rebuild an events using an aggregate.Thus, we destroy all existing events and build reliable projections again.

So what did we get at the end?

We solved many availability problems in our entire system with CQRS and Event Sourcing. In the basket domain where we use these two solutions;

We can process 2.5M events at approximately the same time in 60MS.

We have a projection-store hit ratio of 98%.

Average response times of our back end services are 180–200 milliseconds. (approximately)

Survive in dependency hell

In our system, different micro services had to call the same services cumulatively. This caused an “dependency hell” problem at both the network level and between micro services.

i.e. we needed data about basket, delivery options and payment options to complete payment and create a pre-order state (called as snapshot).

As you can see, these services are stateless. they certainly don’t know each other’s results.

So how did we solve this “dependency hell” here?

Actually we used a simple concept. Services that provide the same situation in any time period can share their status among each other. we compress the states between services with brotli and transmitted them to each other via http-header. Thus, services that need the same situation do not make a service call again using the data in the http-header. We call it dependency reducer.

after implementation of dependency reducer

Stuck at the borders

We did not have adequate system and application metrics, so we were stuck with unreal limits for a long time.

— We could not observe the impact of the improvements we made.

— We could not track the results of the improvements we made for our customers.

— We couldn’t see our technically limits and we couldn’t move forward.

As a result of all this, we were often making the wrong decisions or wasting time with useless solutions. We had solved most of our technical problems.So, We decided to make our product data-centric to determine our own limits.

I have to say, Any product team within a large-scale technology company such as Hepsiburada cannot become data-centric as per Conwey Law. The entire organization needs to be a data center as fast as possible.

CHAPTER TWO: CHOOSING THE RIGHT TECHNOLOGY SPECS

We had to make the right,sustainable and robust technological choices after separating all the domains from each other.In theory, we divided our system into 40 different micro-services, but we did not decide how or in what way we would design them. We followed the rules below while making this decision

Stay away from hype-driven development
Trust your feelings
Consider The Problem Space
Check the Community
Consider the quality of developers you want to attract
Use technologies that fit our company’s core values.

Our monolith service developed with C#. So, We decided to use c # in our services where complex logic is concentrated in Erebor. We’ve developed our less complex services with go. We preferred Node JS in our cross cuttings. Thus, we have determined the language distribution in our services as follows.

We used apache cassandra to write and store our events because of its Flexible schema, Highly scalable and highly available with no single point of failure, Very high write throughput and good read throughput. We also used it to store our projections.

We preferred the MongoDB(as a sharded cluster) to store and use our aggregates. We use five different mongodb clusters and we have approximately 100 different nodes.

We use an apache kafka cluster for our domain events. Kafka is capable of handling high-velocity and high-volume data using not so large hardware. It is capable of supporting message throughput of thousands of messages per second. Also, Kafka is able to handle these messages with very low latency of the range of milliseconds, demanded by most of new use cases.

On the other hand, We preferred rabbitMQ for our integration events. As you know, RabbitMQ offers a variety of features to let you trade off performance with reliability, including persistence, delivery acknowledgements, publisher confirms, and high availability. Also, Messages are routed through exchanges before arriving at queues. RabbitMQ features several built-in exchange types for typical routing logic. For more complex routing you can bind exchanges together or even write your own exchange type as a plugin.

Below you can see all the products we use.

CHAPTER THREE: THE COMPLEXITY BEHIND THE SIMPLICITY

After we launched our system, we faced with different problems.Our new system was telling us to change our perspective due to the following problems.

Was a single product team enough for this much microservice?

As you know, Amazon uses a Simple Rule called “The two-pizza rule” to Maximize Meeting Efficiency. We believe that this rule applies to product teams. If our motivation is to increase the productivity of the product and the production capacity of the team, we should prefer vertical teams instead of these horizontal teams. For this reason, we decided to transform horizontal teams like us into vertical teams in time.

Communication problems increase “exponentially as team size increases.” Ironically, the larger the team, the more time will be spent on communication instead of producing work. J. Richard Hackman

How could such micro-services be maintained?

In fact, we haven’t decided on the usage strategy of our repositories yet. We are still discussing monorepo, multirepo or hybrid. We are still thinking about how to make improvements at many points including the CI / CD. I think we will learn by experience what the truth is.

How would the learning curve be affected?

The biggest disadvantage of microservices and event-based architectures is the learning challenge. The learning curve threshold of monolith systems is much lower than microservices. However, using different languages and different products together will further increase the learning threshold.If you’re a horizontally designed team like us, you probably won’t be able to solve this problem in the long run. Perhaps you can have “lunch & learn” meetings regularly to reduce this impact and ensure the team has the necessary maturity or you can try pair programming, which is the most effective solution.

How will we maintain the strength of these services?

Actually we use the attributes specified in the “building evolutionary architectures” for this. We test these attributes that we have determined both technically and theoretically. The attributes we have selected are: adaptability, autonomy, availability, configurability correctness, effectiveness, durability, usability, failure transparency, fault tolerance, maintainability, manageability, scalability, stability, traceability, testability. We have not yet automated testing of these features. However, we will implement “Fitness Function-Driven Development” as soon as possible.

How would we monitor so many micro-services?

Monitoring is an extremely difficult problem in microservices and event-based systems.

Each application will have unique needs relating to monitoring. There are a few common metrics you’ll want to record. They include:

Application Metrics

The system must be able to collect and serve top-level data. These top-level data are useful for development teams and the organization to understand the functional behavior of the system

Platform Metrics

These metrics report on the nuts and bolts of your infrastructure. These metrics provide a dashboard that can be used to understand low-level system performance and behavior.

System Events

Operations staff knows there is a strong correlation between new code deployments and system failures. Scaling events, configuration updates and other operational changes are also relevant and should be recorded. Recording these events will also make it possible to correlate them with system behavior.

Business Metrics

These metrics should be collected to track users’ behavior.In addition, improvements can only be made based on these metrics.

We collect application and platform metrics with using APM. Also, We also use our in-house services to collect business metrics.

In conclusion,

I would like to share with you some of the results we have achieved after all these efforts, with the following two different graphs.

2017–2019 legacy checkout — 2020 erebor in BF

We have measured that our new system can handle eight times more load than the peak in the graph above. Also, You can see below the average response times of our system under heavy load.

I would like to thank my teammates and the whole team who made an incredible effort during the process and helped us deliver this system to users with superhuman energy and never gave up.

You can contact me at cem.basaranoglu@hepsiburada.com

Happy Coding!

Useful links