Building Out an Antifragile Microservice Architecture @ Andela — Design Consideration

On May 3rd 2016, I transitioned to a new role @ Andela. I was formally a technical trainer in charge of training incredibly smart young minds to become world class junior ruby/rails developers. In my new role, I will be in charge of migrating all Andela systems into an Antifragile Microservice Architecture with the help of 5 incredible developers. As at this point, Brice Nkengsa(the Head of Engineering @ Andela) has already made it easier for us, since the monoliths has been spliced into backend and frontend apps with an API gateway in between. Check out Brice Nkengsa recent blogpost on Migrating to a Microservice Architecture to learn more about the steps he took to achieve this.

Below are some design considerations made while building out the Antifragile Microservice Architecture.

Synchronous Inter-process Communication(IPC)

Since the services are micro, there will be a lot of them and they will need to communicate with each other. The Key metrics here is speed. This is because we want to reduce the network latency as much as possible to allow for faster response time irrespective of whether a request ends up calling a lot of services or not. Below are some of the options that was available to us:

  • REST(JSON or XML)
  • SOAP(xml) — cross platform
  • Thrift(binary) — cross platform (developed at facebook)
  • Java RMI(binary) — JVM
  • Avro(binary) — cross platform
  • Protocol buffers(binary) — cross platform (developed at google)

REST is perfect for exposing API endpoints to external applications, we exposed our API gateway endpoints as RESTful resources, however it is not perfect for IPC where speed is of utmost importance. This is because serializing and deserializing JSON payload is an expensive operation and JSON payload is quite large compared to it’s binary counterpart. XML is even worse than JSON in these regards. We didn’t go with SOAP for obvious reasons. Java RMI would have been a viable option since it uses a binary format for it’s payload. However, we couldn’t go with it since it only runs on the JVM environment.

Thrift, Avro and Protocol buffers are cross platform RPC implementation whose payload are in binary format. It was a tough choice choosing one out of these 3 mechanism. Binary payload are very small and unlike JSON, converting binary data to an object is not expensive. We eventually settled with protocol buffers, since it has a very nice Interface Definition Language(IDL) and it’s approach to schema evolution made more sense to us.

Microservice Frameworks/Toolkit

There are a lot of microservice frameworks out there, however they limit you in one way or another. These limitation can be in form of constraining you to a specific platform(Netflix Microservice toolkit— JVM) or to a specific language(seneca — Nodejs or Go-Kit — Golang). We wanted a framework/toolkit that support at least the following languages(Golang, Nodejs, Python, Ruby and Java) and we eventually settled with gRPC(it was a match made in heaven).

In their own words gRPC is:

A high performance, open source, general RPC framework that puts mobile and HTTP/2 first.

It also uses protocol buffers as it’s default message format which is a plus for us. Check out gRPC design principles and motivations to find out some of the reason why we think gRPC is the best framework for building out microservices.

Event Driven Data Management

Since we are building an Antifragile microservice, each service will have it’s own database as against sharing a common database with other services. However, this came with two major challenges

  • The first challenge is how to implement business transactions that maintain consistency across multiple services.
  • The second challenge is how to implement queries that retrieve data from multiple services.

Due to the above challenges, we had to implement an event-driven architecture(choreography). In this architecture, a microservice publishes an event when something notable happens, such as when it updates a business entity. Other microservices subscribe to those events. When a microservice receives an event it can update its own business entities, which might lead to more events being published. The blogpost below was an eye opener when we were doing research on event-driven architecture.

Once we agreed on implementing an event-driven architecture, it was now time to choose a message broker that will enable services to publish and subsribe to events. Below were the strong candidates we considered.

  • Nats
  • RabbitMQ
  • Kafka

In their own words, NATS is:

A central nervous system for modern, reliable and scalable cloud and distributed systems.

It was written in Golang and it’s incredibly fast. The fastest message broker I have seen so far(as depicted in the above chart). However nats made some design decision which we were not very comfortable with. These includes:

  • At most once delivery (fire and forget)
  • Lack of Persistence

In a fire and forget approach, messages are not guaranteed to be delivered. I think the decision to use fire and forget approach as it’s delivery pattern was due to the fact that NATS does not have a persistence layer.

RabbitMQ is a very mature, robust and easy to use messaging platform. Even though it’s speed is very low compared to Nats, Kafka and Redis, it is ideal for most messaging situation. We however, didn’t go with it, since it has no support for event sourcing.

Kafka was the ideal choice for our use case. Reason been that:

  • Kafka is incredibly fast(second only to Nats)
  • It has built in Persistence using the concept of logs
  • We could replay events and it inherently supports the concepts of event sourcing

I will be discussing event sourcing and how we are implementing it in subsequent posts since it is the major reason why we chose Kafka.

Testing Strategy

Testing a monolith is easy since the app is a monolith. That’s not the case when you have tens or hundreds of microservices. How do we know that a change that was made to a microservice won’t break any other microservice or API gateway that depends on it. We approach testing in 3 phases:

  • Unit tests: All units in each microservice are fully tested. This is to ensure that individual unit works on it’s own.
  • Component tests: This is basically end to end test for each microservice. This tells us that all units in a microservice can work well together without failing. All micoservices, including the API gateway has a full component test suite.
  • Acceptance tests: This is maintained as a separate repo and it’s written following user story format and built using godog golang library. A user story can be in this format

We are still investigating patterns that will enables us write very good test suites. It might be a never ending quest.

CI/CD Pipeline

Each of our microservice integrates with CircleCI which build’s the service and runs it’s test on every commit. In the CircleCI environment, the service connects to the local postgres instance and an online kafka instance since it needs both service to build the service and run it’s tests. We are still building out our CI/CD pipeline via ConcourseCI which is triggered on any push to develop branch.

ConcourseCI pulls the develop branch on every commit to it and run’s it test .If it passes, it build’s a docker image with a new sematic version tag and pushes the image to google container registry. This is followed by a deploy to staging environment which already has other microservices deployed. If the deploy is successful, ConcourseCI pulls the acceptance test repo and runs it’s tests suites. Once this test passes, the final docker image is built and pushed to google container registry. The final deploy to production will be manually triggered. Currently this is what our pipeline looks like.

Deployment Strategy

Deploying a monolithic application means running multiple, identical copies of a single, usually large application. You typically provision N servers (physical or virtual) and run M instances of the application on each one.

A microservices application consists of tens or even hundreds of services. Services are written in a variety of languages and frameworks. Each one is a mini-application with its own specific deployment, resource, scaling, and monitoring requirements.

Below are a few different microservice deployment patterns we considered

  • Multiple Service Instances per Host Pattern
  • Service Instance per Host(VM) Pattern
  • Service Instance per Container Pattern
  • Serverless Deployment

Learn more about each of these patterns here. We eventually went with Service Instance per Container Pattern(using Docker as the container engine) for the following reasons:

  • It gave us the flexibility to use any language(unlike serverless deployment)
  • It gave the ability to run many microservices in one host(unlike Service Instance per Host(VM) Pattern)
  • It gave the ability to run each microservice in a sandbox with it’s own cpu and memory resources(unlike Multiple Service Instances per Host Pattern)

Once we chose to use Service Instance per Container Pattern, it became apparent we had to choose a container orchestration platform. This platform will be responsible for the following:

  • Horizontally scaling each microservice
  • Automating rollout and rollback
  • Service discovery and load balancing
  • Restarting containers that fail etc

We came across two options that made sense (kubernetes and Marathon). They are both free and opensource. We decided to go with Kubernetes since google already has a hosted kubernetes service(google container engine).

Conclusion

It has been an interesting journey for us(the core team) migrating Andela systems into Antifragile Microservice Architecture. We have been learning a lot and pushing the boundaries of our current ability. We question each decision we make to ensure we are making the right ones. It has not been easy but it has been fun and exciting.

I will be releasing a blog post every week to discuss some of the technologies outlined above and others I didn’t get the chance to touch(circuit breakers, monitoring, metrics etc).

If you liked this, click the💚 below so other people will see this here on Medium. Also, if you have any question or observation, use the comment section to share your thoughts/questions.