How did we fine-tune the Go application performance

Sam Wang
honestbee-tw-engineering
4 min readMar 29, 2019

Before the article start, I want to announce that honestbee just open sourced our golang worker-pool library jobq, it’s a very easy-to-use library and can save your time on maintaining the goroutine and channel usage, feel free to comment, reporting issues and contribute to making it better :)

The increased response time

In honestbee app usage, one of the important part is the address search, because it’s the entry point of using our services. In order to create more flexibility and convenience, we built a service called “Atlas” to become a delegation server with postal code search and Google maps api call.

Recently we found the response time increased around 100ms after modified some code based on new Google Place Autocomplete policy to use session_token.

Average response time increased 100ms more

Then, based on the usual behavior, the first response is to check what did we do wrong on the code modification, and the suspected point is definitely on the session_token adding, because it’s generating based on each api call to Atlas and the token itself is a UUID created via Random, and On Unix-like systems, Random Reader reads from /dev/urandom and performs I/O actions.

Is that the root cause?

Performance issue solving always like a crime scene investigation and you always need to dig out more, and we found it’s actually not causing via UUID generate because it’s only using extra 300 nanoseconds and very little of the memory.

GC pause time increased
GC pause fraction raised

Based on the above abnormal GC pause graph, we can sure it’s because of the memory being stuck for a longer period and the original GC cycle cannot handle it anymore, it’s very obvious from below newRelic graph (Yellow part).

Redis response latency increased

and we go check on AWS elastic-cache, indeed, the Swap usage raised to almost 1.5GB lol, that’s because of lacking the physical memory so system decided to use Swap to cover.

The Recovery

We decided to increase the memory size from cache.t2.small to cache.m1.medium, also adjust the GOGC percentage from 100 to 400 because during this period of time honestbee also enabled the capability of SG search using text-based content and JP postal code search to improve accuracy, that also increased the traffic (throughput).

The GOGC variable sets the initial garbage collection target percentage. A collection is triggered when the ratio of freshly allocated data to live data remaining after the previous collection reaches this percentage. The default is GOGC=100. Setting GOGC=off disables the garbage collector entirely. The runtime/debug package’s SetGCPercent function allows changing this percentage at run time. See https://golang.org/pkg/runtime/debug/#SetGCPercent.

It worked!! and the SLA also back to normal.

Response time recovered

Is it the end of the story?

No, because we still can observe the long latency of Web external (Google maps api), after searching in the code base, previously the engineers using for loop to call google api and complete the full results, it means that one of the searches needs to call 1 AutoComplete request and based on the predictions to get the detail lat/long and postal code from Geocoding api, like A+B+C.

To solve it is actually quite simple, which is make concurrent can happen on the search result gathering, this is the time that jobq comes in.

Setup the dispatcher with limited size
Using jobs to handle the first level of the different type of address querying
The second level of the prediction geocoding reverse

The reason we not using dynamic adjust on jobq dispatcher is because of the job will also trigger I/O (postgres & redis), increase automatically may cause fd running out issue and lead to panic.

After this approach, we successfully reduced from A+B+C time to a more flatten response time (depends on the latency of A or B or C) and saves around 1/3 response latency time.

Conclusion

Monitoring the performance is both interesting and tough, but once if you can solve it, you will learn a lot from the experience, I did, this week.

--

--