Microservice Orchestration

At Jexia we believe in a microservice architecture is the best way to organize our backend cloud — it gives a nice separation of concerns and provides clear boundaries of the responsibilities of the specific tasks. By using a microservice architecture, we can scale specific parts of the backend application depending on the load of every specific part. Last but not least, it is nice for us developers to have our own ‘pet projects’, which we can build the way we want to.

This is nice and all, but when put into practice, things begin to get more complex: Which service do I need to call to perform some task? My task involves using multiple services in a particular order, but which order? How can we rollback when one of the actions failed?
Of course we don’t want our frontend and SDK teams to be bothered with such trivial things. There are also ‘hypothetical’ scenarios, where the full documentation is not yet written — yet another potential stumbling block.

So we introduced an orchestration service. A single call to this orchestration service results in one or more requests to the backend microservices. This orchestration service needed to be fast, simple, dynamic, small, configurable, easy to use, working, etc. Basically no trade-offs and all good stuff included. We also chose to write it in Golang to prevent confusion within the backend team, as all other services are written in Golang already.

We liked the Netflix conductor ideas, but it was too big and complex for our expected use-cases (and was not written in Golang). Some frameworks and libraries provided (parts of) our requirements, but lacked in flexibility, complexity, quality, etc. So we had to create our own.

Design choices

Our primary goal is flexibility — we want to avoid having to change the code every time a new service is added or a new task is requested. The modular design shown in the figure seems to be our best bet.

Architecture of our Microservice Orchestrator

It has three layers:

  • API (Input) modules
    These connect the system with the outside world. They implement the protocols like HTTP REST, GraphQL, web sockets — whatever is needed! The requests are then forwarded in an internal format to the Logic module.
  • Logic
    This part translates the converted requests into one or more calls to the microservices. The results are then gathered and reported back to the API module.
  • Caller (Output) modules
    These handle the interaction with the actual microservices. The interaction is executed by the logic of the caller and provided with the needed parameters and details. Each caller handles its own protocol. This protocol can be a generic one, such as a HTTP JSON protocol for talking to multiple services, or a specialized protocol (e.g. using protobuf) to communicate with a specific service.

Logical flows

The logic layer is configurable using a custom DSL called Flows (our originality is the best, it’s true) which can be read at any time to update the logic (currently we read it at startup only).
The Flow DSL consists of service, middleware and flow blocks, as shown in the following example.

flows {
service "security" {
identifier: "auth"
}
    service "item-manager" {
identifier: "http"
address: "http://item-manager.service.cloud/"
}
    service "image-service" {
identifier: "http"
address: "http://images.service.cloud/"
}
    service "data-store" {
identifier: "json-rpc"
name: "DataStoreService"
}
    middleware "Authorize Token" {
// returns an error when token is not valid,
// aborting the rest of the flow
service = "security"
method = "TokenValid"
request = {
"token" = "{{{input_token}}}"
}
}
    flow "item details" {
input = ["id", "token"]
middleware = ["Authorize Token"]
order = ["details", "image URL"]
        call "details" {
type = "fork"
call "basic details" {
service = "item-manager"
method = "Item.Details"
request = {
"id": "{{{input_id}}}"
}
response = ["name", "description", "imageId"]
}
call "relations" {
service = "data-store"
method = "Item.Relations"
request = {
"id": "{{{input_id}}}"
}
response = ["relatedItems"]
}
}
call "image URL" {
type="sync"
service = "image service"
method = "URL"
request = {
"id": "{{{basic details:imageId}}}"
}
response = ["imageURL", "width", "height"]
}
}
    output = {
"name": "{{{basic details:name}}}"
"description": "{{{basic details:description}}}"
"image": "{{{image URL:URL}}}"
"relatedItems": "{{{relations:relatedItems}}}"
}
}

Note that some fields like timeout, retries, etc. are not shown in the example to reduce the length a bit.

The services configure the callers (I guess, we could have named them more similarly…) and make them available to the flows. Each service might include parameters that are provided to the caller in order for it to communicate with the backend service.

The main part of the Flows DSL are the flows. They define the service calls that should be executed in order to perform a certain task. Next to the list of calls they also have some properties. These include required inputs and outputs, required middleware, call order, timeout information, etc.

Each call has a type, currently we support synchronous and forked calls.
The synchronous calls wait until the call is finished before starting the next call. Forked calls spawn multiple child calls concurrently and wait until all of them are finished (joined, hence the name forked).

Just as it is with flows, calls also have a set of properties next to the type. These include request and response parameters that are passed from/to other calls, details on how to call the service, timeout, etc. For even more flexibility we added a templating engine, based on mustache, to the request parameters. This makes it possible to combine static values with one or more variables.

Middleware is basically a synchronous call that can be reused, making some common tasks available to all flows that require them. For example, checking if the request contains a valid token is required for all flows that handle non-public information. This task will be used heavily in applications, where users need to sign in to access certain functionality/information. In our case the middleware calls for a flow are always executed before its regular list of calls.

The Flows DSL is a bit extensive/long, but after it is read in memory it is converted to arrays and maps that allow for fast lookup and execution. The current format is chosen to keep things as simple as possible for developers. Due to the human-readable format they can easily understand what is happening in a flow. This makes it easy to adjust the flow without any need to delve into the code.

API Providers

As said before, the main task of the API is to match a request to a flow for the supported protocol.

For example an HTTP REST API recognizes https://myapp.jexia.com/item/1 and matches it to the item details flow, which gathers the information from multiple services. This information is then presented as a single resource to the API, which sends the response back to the user.
Another API, implementing GraphQL for example, processes https://myapp.jexia.com/grapql?query={item(id:1){name,description,image,relatedItems}, which also activates the item details_ flow. (Yes I know that it would be more efficient to only grab the data that is requested, but this is only a simple example. I could not think of something else that made more sense. How long until the weekend?!).

The outputs of the flow are received by the API when the flow has finished. Then, they are used to create the proper response format for the protocol it implements.

An API module typically opens its own listening interface and handles the incoming request in a way that is best for the protocol, priorities, and other requirements.

Each API module has its own configuration file, containing all routes and mappings to the logical flows. This way, adding a new route/flow does not require to add more code, recompile, etc. In our case, it can be done by modifying the configuration. Note that this is similar to the Flows DSL.

Service Callers

The service callers get signaled by the flows to send a request to the actual microservice and to wait for its response. Different protocols have different callers.
A caller can be generic, like a HTTP JSON caller that is able to communicate with any HTTP service that accepts JSON requests, or a caller can be specific, like a caller that uses a protobuf definition to communicate with the service.
The type of caller can be chosen depending on the project requirements (tight coupling of service with orchestrator for security reasons) and situation to work with (e.g. the availability of existing services).

Roadmap / Food for thought

The described flows are currently implemented and being tested internally.
Naturally, we notice that with some additional features, this can become an even better orchestrator. Some of these are described in this final part of the article.

Realtime Communication

Providing a channel instead of the actual results opens up new possibilities with our backend. Things like updating graphs, health/status overviews, application/website activity are just some of the potential features that come to mind.

Besides these trivial examples, it is also nice for long running jobs, like a request to count all the money you earned with your application.
Usually, requests are not meant to be open for a couple of minutes (yes we have very fast hardware, so counting all that money does not take weeks), instead the open channel is perfect to report back with the results at a later time.
From a human’s point of view, waiting even a minute might be too long. The orchestrator, however, is likely to be used by other applications and services that don’t mind the wait.

The open channel connects the client to the (backend) service, via the orchestrator or directly. This decision is influenced by whether:

  • the data needs to be filtered (security, client interests, etc.)
  • the service actually supports real-time communication
  • the service is a message queue
  • etc.

Convenience aside, this reduces the load on the backend and orchestrator. The reason being that request updates are now being pushed instead of the backend getting polled over and over again. As a result, the end user has a more pleasant experience.

Benchmarking

Looking at our design, simplicity, and mad coding skills, we think that our orchestrator will perform very nicely.
But our CTO (and our other development teams as well I guess) wants proof… (where is the trust in your (fellow) developers nowadays?!)
So some benchmarks are planned to measure performance, robustness, scalability, etc.
We can’t provide these details yet, since this is still in the planning phase. We do promise to have a nice article about them when they become available, though.
For now, you have to put your trust in our skills (or keep checking this blog for the proof if you must, you unbeliever… or drop by at the office for some coffee).

Atomicity / Transactions

This mechanism states that either all service calls should be successful or nothing should have been changed. By introducing it, we ensure our microservice environment does not break due to the distributed nature of the architecture.

For example, if a flow consists of 3 calls, the first one succeeds and the second fails: the first one needs to be rolled back (or not made permanent yet) and the third should not get called.

This is a very complex challenge for which we have some ideas, varying from manually fixing the database (suitable solution since nothing goes ever wrong /mad coding skills FTW/, so this does not require an army of drones or any other form of overhead) to fancy consensus algorithms (lots of overhead…).
The first is implemented already (including a few drones, but they only fly, which is fine for now), the latter will be implemented if all other solutions are not working as required.

A more likely, intermediate approach will be to make our platform robust enough to deal with partly changed data without breaking down.
This can be done by updating a temporal database/table and switch when all services reported their success… or by making the changes in such a way that they are ‘backwards compatible’. As you see, we are currently playing with a few ideas.

Monitoring, Logging & Tracing

Keeping track of the usage of our backend, providing information when something fails (unlikely with our coding skills, but the CTO wants it…) or responding on undesired scenarios is important. Now, the microservice orchestrator is basically the gateway between the internal cloud and the public world. This makes it a very convenient place to add these kinds of features. You won’t be surprised to learn this is exactly what we intend to do.

The incoming tasks in the Logic layer use the same (internal) format, so logging them in a specific format is easy. If needed, the actual requests and potential misbehavior can be monitored in the API modules. Later these log(event)s can be used to automatically respond on undesired misbehavior. This can vary from limiting the rate/throughput to denying access completely.

This format can also be used to add tracing information, so all log messages of concurrent requests and actions can be easily filtered per request. We expect this to be an awesome time saver when debugging some (rare) issue in this concurrent environment.

Last but not least, this information can be used to show the application usage on our dashboards in the office. This will keep the developers even more motivated…

Conclusion

If you resisted falling asleep and are still somehow reading this, I have an insider’s secret for you — Make sure to keep an eye on this blog, as we might open-source our microservice orchestrator for everyone to use (soon, but not a moment sooner ;-) )