40X Better Performance Thanks to Goroutine and Channel

Indra Saputra
Inside Bukalapak
Published in
3 min readOct 18, 2018

--

This is a story of how we optimized our process. When we built it for the first time, the process took at least 10 minutes to do its job. Later on, we used goroutine and channel to optimize it. The result was amazing! Now, it only takes 15 seconds on average to do its job. Around 40 times better!

Background

In my company, one of the many things that must be obeyed is microservice standardization. We have so many standards. Some standards are technical, while the others are documentation. Both technical and documentation are important. The former helps us to build a robust, reliable, resilient, and performant service. The latter helps us to understand what we have built, from a simple guide for contributing to how to use the service.

Every deployed microservice must implement all standards. Some of our standards are metric, log, CI/CD, documentation (README.md), and resource limit. Unfortunately, the standards are made after we migrate to microservice architecture. Thus, some deployed microservice may not implement the standards. For example, the standards say that a microservice must have these six fields in its request log: request_id, duration, message, actor, full_path, and host.

A standard is a standard. We must implement it to make our life better. So, we began to collect the “nonstandard microservice”. Since we have so many microservices and standards, it is quite painful if we check them manually. We automate it!

The Program v1.0.0

Let’s use the five standards mentioned above as the standards that we need to check. The microservices are deployed in Kubernetes. Prometheus is used to record the metrics. Logs are written in ELK Stack. Project’s documentation (README.md) is hosted in Github. Gitlab-CI is used as our CI/CD platform. The resource limit definition is available in deployment.yml.

The good news is all stacks provide some APIs that can be used to retrieve the data we need.

So, let’s get down to the code.

From the code above, simply we can use CheckStandards method. Later on, we can build the main program as follow.

That was our first try. We run the program and got the result. It took at least 10 minutes to get the result.

We have hundreds of microservice running in production. Checking a service requires us to send HTTP request to each dependency’s API. It was tolerable since we only needed to check it once.

Weeks later, the requirement changed. the program should be able to run anytime we want. When you want to know the list of nonstandard microservice, it is boring to wait for at least 10 minutes.

In Goroutine and Channel We Trust

We had an idea that each service can be checked concurrently instead of synchronously. Thanks to Golang, concurrency is easy. We use goroutine for concurrency and channel for communication.

We modified CheckStandards method. We needed to send our channel to the method. The method doesn’t need to return the value. It only needs to send the result to the channel. Here is the code.

We run the program and the result was amazing! It took 15 seconds on average.

Conclusion

I read many articles about concurrency in Golang. In this case, I didn’t think I would use goroutine for some reasons. First, we only needed to run the program once. Second, the dependencies were not broken even we access them over and over.

But, the requirement was changed. It needed to be run anytime. This new requirement made 10 minutes was not a tolerable number anymore. We must change the implementation. At this point, goroutine and channel came in help.

If performance matter, I think concurrency should be one thing to consider. And, talk about concurrency, I think Golang provides some good tools. Goroutine and channel are easy to learn and use.

Anyway, the solution we gave above is not the best way of using goroutine and channel, especially for a high number of request. In the given solution, we can’t control how many goroutines we are spawning. But, it works well because at the moment the number of services is small (hundreds).

Then, what about handling thousands or even millions of requests? Well, that should be another story to tell in a different post :D

--

--