How we optimized our DNS server using go tools


background

Arash Cordi
Jan 11 · 18 min read

with tens of thousands of web sites using our authoritative dns server, we respond to millions of requests every day. dns attacks are getting more and more widespread these days, dns is a crucial part of our system and we must make sure that we can perform well under high pressure.

dnsflood is a small tool I found capable of producing huge number of udp requests.

# timeout 20s ./dnsflood example.com 127.0.0.1 -p 2053

monitoring our systems showed that memory usage of our service was growing so fast that we had stop our service or we would get into OOM errors. this looks like a memory-leak problem; there are different causes for “Kind-of” memory leaks and “Real ” memory leaks in go:

  • hanging goroutines
  • incorrect use of defer & finalizer
  • substring & subslices
  • global variables

this post has an in-dept explanation of different leak cases.

before jumping to any conclusions let’s do some profiling first

GODEBUG

different debug facilities can be enabled using GODEBUG environment variable by passing a list of comma seperated name=value pairs.

scheduler trace

scheduler trace can provide information about the runtime behaviour of goroutines. to enable scheduler trace, run the program with GODEBUG=schedtrace=100 the value means output period in ms.

$ GODEBUG=schedtrace=100 ./redins -c config.json
SCHED 2952ms: ... runqueue=3 [26 11 7 18 13 30 6 3 24 25 11 0]
SCHED 3053ms: ... runqueue=3 [0 0 0 0 0 0 0 0 4 0 21 0]
SCHED 3154ms: ... runqueue=0 [0 6 2 4 0 30 0 5 0 11 2 5]
SCHED 3255ms: ... runqueue=1 [0 0 0 0 0 0 0 0 0 0 0 0]
SCHED 3355ms: ... runqueue=0 [1 0 0 0 0 0 0 0 0 0 0 0]
SCHED 3456ms: ... runqueue=0 [0 0 0 0 0 0 0 0 0 0 0 0]
SCHED 3557ms: ... runqueue=0 [13 0 3 0 3 33 2 0 10 8 10 14]
SCHED 3657ms: ...runqueue=3 [14 1 0 5 19 54 9 1 0 1 29 0]
SCHED 3758ms: ... runqueue=0 [67 1 5 0 0 1 0 0 87 4 0 0]
SCHED 3859ms: ... runqueue=6 [0 0 3 6 0 0 0 0 3 2 2 19]
SCHED 3960ms: ... runqueue=0 [0 0 1 0 1 0 0 1 0 1 0 0]
SCHED 4060ms: ... runqueue=5 [4 0 5 0 1 0 0 0 0 0 0 0]
SCHED 4161ms: ... runqueue=0 [0 0 0 0 0 0 0 1 0 0 0 0]
SCHED 4262ms: ... runqueue=4 [0 128 21 172 1 19 8 2 43 5 139 37]
SCHED 4362ms: ... runqueue=0 [0 0 0 0 0 0 0 0 0 0 0 0]
SCHED 4463ms: ... runqueue=6 [0 28 23 39 4 11 4 11 25 0 25 0]
SCHED 4564ms: ... runqueue=354 [51 45 33 32 15 20 8 7 5 42 6 0]

runqueue is the length of global runable goroutine queue. numbers in bracket are length of per process queue.

an ideal situation is where all processes are busy runnnig goroutines and a reasonable runqueue length evenly distributed between all processes:

SCHED 2449ms: gomaxprocs=12 idleprocs=0 threads=40 spinningthreads=1 idlethreads=1 runqueue=20 [20 20 20 20 20 20 20 20 20 20 20]

looking at our schedtrace output we can see there are time frames when almost all processes are idle. this means we are not utilizing full cpu power.

garbage collector trace

To enable the garbage collector (GC) trace, run the program with GODEBUG=gctrace=1 environment variable:

GODEBUG=gctrace=1 ./redins -c config1.json
.
.
.
gc 30 @3.727s 1%: 0.066+21+0.093 ms clock, 0.79+128/59/0+1.1 ms cpu, 67->71->45 MB, 76 MB goal, 12 P
gc 31 @3.784s 2%: 0.030+27+0.053 ms clock, 0.36+177/81/7.8+0.63 ms cpu, 79->84->55 MB, 90 MB goal, 12 P
gc 32 @3.858s 3%: 0.026+34+0.024 ms clock, 0.32+234/104/0+0.29 ms cpu, 96->100->65 MB, 110 MB goal, 12 P
gc 33 @3.954s 3%: 0.026+44+0.13 ms clock, 0.32+191/131/57+1.6 ms cpu, 117->123->79 MB, 131 MB goal, 12 P
gc 34 @4.077s 4%: 0.010+53+0.024 ms clock, 0.12+241/159/69+0.29 ms cpu, 142->147->91 MB, 158 MB goal, 12 P
gc 35 @4.228s 5%: 0.017+61+0.12 ms clock, 0.20+296/179/94+1.5 ms cpu, 166->174->105 MB, 182 MB goal, 12 P
gc 36 @4.391s 6%: 0.017+73+0.086 ms clock, 0.21+492/216/4.0+1.0 ms cpu, 191->198->122 MB, 210 MB goal, 12 P
gc 37 @4.590s 7%: 0.049+85+0.095 ms clock, 0.59+618/253/0+1.1 ms cpu, 222->230->140 MB, 244 MB goal, 12 P
.
.
.

as we can see here, inuse memory is increasing and the amount of time needed for gc to do it’s job is also increasing. this means that we are consuming more memory than gc can process.

more on GODEBUG and some other golang env variables here.

enabling the profiler

go tool pprof is the tool for analyzing and profiling data. there are two ways to set up pprof either by directly calling runtime/pprof functions like pprof.StartCPUProfile() in your code or setting up net/http/pprof http listener and getting data from there which is the way we used. pprof has very little overhead so it is safe to use in production, but profile endpoint should not be exposed publicly because they may reveal sensitive data.

for the second option all we need to to do is import “net/http/pprof” package:

import (
_ "net/http/pprof"
)

then add a http listener:

go func() {
log.Println(http.ListenAndServe("localhost:6060", nil))
}()

pprof has several default profiles:

  • allocs: A sampling of all past memory allocations
  • block: Stack traces that led to blocking on synchronization primitives
  • goroutine: Stack traces of all current goroutines
  • heap: A sampling of memory allocations of live objects.
  • mutex: Stack traces of holders of contended mutexes
  • profile: CPU profile.
  • threadcreate: Stack traces that led to the creation of new OS threads
  • trace: A trace of execution of the current program.

note: trace endpoint unlike all other endpoints is a trace profile not a pprof profile, you can view it using go tool trace instead of go tool pprof.

now that we have our tools up and running we can look at the available tools.

cpu profiler

‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍$ go tool pprof http://localhost:6060/debug/pprof/profile?seconds=10

cpu profiler runs for 30 seconds by default (we can change this with seconds parameter) and collect samples every 100 milliseconds then it enters the interactive mode. there are several commands available the most common are top, list, web

use top n to view top hottest entries in text format, there are two options to sort the output, -cum for cumulative order and -flat

(pprof) top 10 -cum
Showing nodes accounting for 1.50s, 6.19% of 24.23s total
Dropped 347 nodes (cum <= 0.12s)
Showing top 10 nodes out of 186
flat flat% sum% cum cum%
0.03s 0.12% 0.12% 16.7s 69.13% (*Server).serveUDPPacket
0.05s 0.21% 0.33% 15.6s 64.51% (*Server).serveDNS
0 0% 0.33% 14.3s 59.10% (*ServeMux).ServeDNS
0 0% 0.33% 14.2s 58.73% HandlerFunc.ServeDNS
0.01s 0.04% 0.37% 14.2s 58.73% main.handleRequest
0.07s 0.29% 0.66% 13.5s 56.00% (*DnsRequestHandler).HandleRequest
0.99s 4.09% 4.75% 7.56s 31.20% runtime.gentraceback
0.02s 0.08% 4.83% 7.02s 28.97% runtime.systemstack
0.31s 1.28% 6.11% 6.62s 27.32% runtime.mallocgc
0.02s 0.08% 6.19% 6.35s 26.21% (*DnsRequestHandler).FindANAME
(pprof)

use list to investigate a single function.

(pprof) list handleRequest
Total: 24.23s
ROUTINE ======================== main.handleRequest in /home/arash/go/src/arvancloud/redins/redins.go
10ms 14.23s (flat, cum) 58.73% of Total
. . 35: l *handler.RateLimiter
. . 36: configFile string
. . 37:)
. . 38:
. . 39:func handleRequest(w dns.ResponseWriter, r *dns.Msg) {
10ms 610ms 40: context := handler.NewRequestContext(w, r)
. 50ms 41: logger.Default.Debugf("handle request: [%d] %s %s", r.Id, context.RawName(), context.Type())
. . 42:
. . 43: if l.CanHandle(context.IP()) {
. 13.57s 44: h.HandleRequest(context)
. . 45: } else {
. . 46: context.Response(dns.RcodeRefused)
. . 47: }
. . 48:}
. . 49:
(pprof)

web command can be used to generate a SVG graph of hotsopts and opens it in a browser.

(pprof)web handleRequest

having lots of time spent in GC functions like runtime.mallocgc often means heavy allocation, which can put extra pressure on garbage collector and increase latency.

having lots of time spent on synchronization mechanisms like runtime.chansend or runtime.lock could be a sign of contention

having lots of time spent in syscall.Read/Write mean exsessive use of io operations

memory profiler

$ go tool pprof http://localhost:6060/debug/pprof/allocs

by default it shows lifetime allocated memory. we can see number of allocated objects using -alloc_objects, other useful options are -iuse_objects and -inuse_space for inspecting live memory.

generally if you want to reduce memory consumption you need to look at -inuse_space but if you want to improve latency look at -alloc_objects after a suficient execution time/load.

identifying the bottleneck

it is important to first identify the type of bottleneck (cpu, io, memory) we are dealing with. aside from profilers, there’s another kind of tool available.

go tool trace can show what our goroutines are really doing in details. to collect trace sample we need to send a http request to trace endpoint:

$ curl http://localhost:6060/debug/pprof/trace?seconds=5 --output trace.out

generated file can be viewed using trace tool:

$ go tool trace trace.out
2019/12/25 15:30:50 Parsing trace...
2019/12/25 15:30:59 Splitting trace...
2019/12/25 15:31:10 Opening browser. Trace viewer is listening on http://127.0.0.1:42703

go tool trace is a web app which uses Chrome DevTools protocol, and only compatible with chrome browsers. main page looks something similar to this :

View trace (0s-409.575266ms)
View trace (411.075559ms-747.252311ms)
View trace (747.252311ms-1.234968945s)
View trace (1.234968945s-1.774245108s)
View trace (1.774245484s-2.111339514s)
View trace (2.111339514s-2.674030898s)
View trace (2.674031362s-3.044145798s)
View trace (3.044145798s-3.458795252s)
View trace (3.43953778s-4.075080634s)
View trace (4.075081098s-4.439271287s)
View trace (4.439271635s-4.814869651s)
View trace (4.814869651s-5.253597835s)

Goroutine analysis
Network blocking profile ()
Synchronization blocking profile ()
Syscall blocking profile ()
Scheduler latency profile ()
User-defined tasks
User-defined regions
Minimum mutator utilization

trace splits trace time to make sure your browser can handle it.

there are myriads of data here which makes it almost decipherable if we don’t now what we are looking for. let’s leave it for now.

the next link on the main page is “goroutine analysis” which shows different kind of goroutines running in the program during the trace period:

Goroutines:
github.com/miekg/dns.(*Server).serveUDPPacket N=441703
runtime.gcBgMarkWorker N=12
github.com/karlseguin/ccache.(*Cache).worker N=2
main.Start.func1 N=1
runtime.bgsweep N=1
arvancloud/redins/handler.NewHandler.func2 N=1
runtime/trace.Start.func1 N=1
net/http.(*conn).serve N=1
runtime.timerproc N=3
net/http.(*connReader).backgroundRead N=1
N=40

click on the first item with N=441703 and this is what we get:

goroutines analysis

this is very interesting. most of goroutines spend almost no time in execution and most of the time is spend in Sync block. let’s have a closer look at one of them:

goroutine trace diagram

it appears our goroutine is almost always inactive waiting on a rwlock. from here we can directly go to block tool; blocking profile is disabled by default, we first need to enable it in our code:

runtime.SetBlockProfileRate(1)

now we can get our block samples:

$ go tool pprof http://localhost:6060/debug/pprof/block
(pprof) top
Showing nodes accounting for 16.03wks, 99.75% of 16.07wks total
Dropped 87 nodes (cum <= 0.08wks)
Showing top 10 nodes out of 27
flat flat% sum% cum cum%
10.78wks 67.08% 67.08% 10.78wks 67.08% internal/poll.(*fdMutex).rwlock
5.25wks 32.67% 99.75% 5.25wks 32.67% sync.(*Mutex).Lock
0 0% 99.75% 5.25wks 32.67% arvancloud/redins/handler.(*DnsRequestHandler).Filter
0 0% 99.75% 5.25wks 32.68% arvancloud/redins/handler.(*DnsRequestHandler).FindANAME
0 0% 99.75% 16.04wks 99.81% arvancloud/redins/handler.(*DnsRequestHandler).HandleRequest
0 0% 99.75% 10.78wks 67.08% arvancloud/redins/handler.(*DnsRequestHandler).Response
0 0% 99.75% 10.78wks 67.08% arvancloud/redins/handler.(*RequestContext).Response
0 0% 99.75% 5.25wks 32.67% arvancloud/redins/handler.ChooseIp
0 0% 99.75% 16.04wks 99.81% github.com/miekg/dns.(*ServeMux).ServeDNS
0 0% 99.75% 16.04wks 99.81% github.com/miekg/dns.(*Server).serveDNS
(pprof)

here we have two different lock (poll.fdMutex and sync.Mutex) responsible for almost 100% of blocks. this verifies our guess about lock contention, now we only need to find were these happens:

(pprof) svg lock

this command creates a vector graph of all nodes accounting for contention with focus on lock functions:

block endpoint svg graph

we can get the same result from our goroutine endpoint:

$ go tool pprof http://localhost:6060/debug/pprof/goroutine

and then:

(pprof) top
Showing nodes accounting for 294412, 100% of 294424 total
Dropped 84 nodes (cum <= 1472)
Showing top 10 nodes out of 32
flat flat% sum% cum cum%
294404 100% 100% 294404 100% runtime.gopark
8 0.0027% 100% 294405 100% github.com/miekg/dns.(*Server).serveUDPPacket
0 0% 100% 58257 19.79% arvancloud/redins/handler.(*DnsRequestHandler).Filter
0 0% 100% 58259 19.79% arvancloud/redins/handler.(*DnsRequestHandler).FindANAME
0 0% 100% 293852 99.81% arvancloud/redins/handler.(*DnsRequestHandler).HandleRequest
0 0% 100% 235406 79.95% arvancloud/redins/handler.(*DnsRequestHandler).Response
0 0% 100% 235406 79.95% arvancloud/redins/handler.(*RequestContext).Response
0 0% 100% 58140 19.75% arvancloud/redins/handler.ChooseIp
0 0% 100% 293852 99.81% github.com/miekg/dns.(*ServeMux).ServeDNS
0 0% 100% 293900 99.82% github.com/miekg/dns.(*Server).serveDNS
(pprof)

almost all our goroutines are staying at runtime.gopark, this is go scheduler putting goroutines to sleep; a very common cause for this to happen is lock contention

(pprof) svg gopark
goroutines endpoint svg graph

here we have our two sources of contention:

UDPConn.WriteMsg()

looks like all responses ended up writing to the same fd (hence the lock), this makes sense because they all have the same source address.

we did a little experiment on different solutions, at the end we decided to use multiple listeners to balance the load, this way we let OS balance incoming request between different connection and reduce contention.

Rand()

looks like there’s a lock underneath generic math/rand functions (more on this here). this can be easily fixed using Rand.New() which creates a random generator without the locking wrapper

rg := rand.New(rand.NewSource(int64(time.Now().Nanosecond())))

this is a little better but creating a new source every time is expensive. can we do better? in our case we don’t really need random number. we just need an even distribution for our load balancing and it turns out that Time.Nanoseconds() can do just fine.

now that we have eliminated all extra blockings. let’s see the results:

goroutine analysis

looks better but still most of time is spent on sync block. let’s take a look at synchronization blocking profile from trace UI main page:

sync block time graph

let’s take a look at ccache promote function from pprof’s block endpoint:

(pprof) list promote
ROUTINE ======================== github.com/karlseguin/ccache.(*Cache).promote in ...
0 9.66days (flat, cum) 99.70% of Total
. . 155: h.Write([]byte(key))
. . 156: return c.buckets[h.Sum32()&c.bucketMask]
. . 157:}
. . 158:
. . 159:func (c *Cache) promote(item *Item) {
. 9.66days 160: c.promotables <- item
. . 161:}
. . 162:
. . 163:func (c *Cache) worker() {
. . 164: defer close(c.donec)
. . 165:

all ccache.Get() calls end up sending to a single c.promotables channel. cache being a crucial part of our service, we have to consider other options; Dgraph has an excellent article about state of cache in go, they also have an excellent cache module called ristretto. unfortunately ristretto doesn’t support Ttl based eviction yet, we could workaround this problem by using a very large MaxCost and keep a timeout value in our cache struct (we wish to keep stale data in cache). let’s see the result using ristretto:

goroutine analysis

Great!

we managed to reduce maximum goroutine runtime from 5000 ms to 22 ms. still most of the execution time is split between “sync block” and “”scheduler wait”. let’s see if there’s anything we can do about that:

sync block time graph

there’s little more we can do about fdMutex.rwlock let’s now focus on the other one : gcMarkDone which is responsible for 53% of block time. this function is part of go garbage collection process. having these on hot spot are often a sign that we are putting a lot of pressure on gc.

alloc optimization

at this point it may be useful to see how go garbage collection works; go uses a tri-color mark and sweep collector. it keeps tracking of everything that’s allocated and once it reaches twice (or whatever value GOGC is set to) the size of previous size, GC sweep starts. mark happens in three phases:

  • Mark Setup (STW)
  • Marking (concurrent)
  • Mark Termination (STW)

Stop The World phases put the entire exectution to halt, although they’re generally very short, tight loops can extend their duration. that’s because currently (go v1.13) goroutines are only preemptible at function call points, thus it is possible for a tight loop to cause arbitrarily long pause time since GC waits for all goroutines to stop.

during marking gc utilizes about 25% of GOMAXPROCS but additional goroutines can be forced into mark assist, this happens when a rapidly allocating goroutine outruns the background marker, in order to reduce the latency caused by gc we need minimize heap usage; there are two things to note:

  1. number of allocations matters more than size (e.g 1000 allocations of a 20 bytes struct puts much more stress on heap than a single 20000 bytes allocation)
  2. unlike languages like C/C++, not all allocations end up in heap. go compiler decides whether a variable escapes to heap or it can be allocated inside stack frame. unlike heap-allocated variables, stack-allocated variables do not put pressure on gc

for more information on go memory model and gc design check this presentation.

for alloc optimization we use a collection of go tools:

  • cpu profiler to find hot allocations
  • memory profiler to track allocations
  • tracer for GC patterns
  • escape analysis to find why allocations happen

let’s start with cpu profiler:

$ go tool pprof http://localhost:6060/debug/pprof/profile?seconds=20
(pprof) top 20 -cum
Showing nodes accounting for 7.27s, 29.10% of 24.98s total
Dropped 315 nodes (cum <= 0.12s)
Showing top 20 nodes out of 201
flat flat% sum% cum cum%
0 0% 0% 16.42s 65.73% github.com/miekg/dns.(*Server).serveUDPPacket
0.02s 0.08% 0.08% 16.02s 64.13% github.com/miekg/dns.(*Server).serveDNS
0.02s 0.08% 0.16% 13.69s 54.80% github.com/miekg/dns.(*ServeMux).ServeDNS
0.01s 0.04% 0.2% 13.48s 53.96% github.com/miekg/dns.HandlerFunc.ServeDNS
0.02s 0.08% 0.28% 13.47s 53.92% main.handleRequest
0.24s 0.96% 1.24% 12.50s 50.04% arvancloud/redins/handler.(*DnsRequestHandler).HandleRequest
0.81s 3.24% 4.48% 6.91s 27.66% runtime.gentraceback
3.82s 15.29% 19.78% 5.48s 21.94% syscall.Syscall
0.02s 0.08% 19.86% 5.44s 21.78% arvancloud/redins/handler.(*DnsRequestHandler).Response
0.06s 0.24% 20.10% 5.25s 21.02% arvancloud/redins/handler.(*RequestContext).Response
0.03s 0.12% 20.22% 4.97s 19.90% arvancloud/redins/handler.(*DnsRequestHandler).FindANAME
0.56s 2.24% 22.46% 4.92s 19.70% runtime.mallocgc
0.07s 0.28% 22.74% 4.90s 19.62% github.com/miekg/dns.(*response).WriteMsg
0.04s 0.16% 22.90% 4.40s 17.61% github.com/miekg/dns.(*response).Write
0.01s 0.04% 22.94% 4.36s 17.45% github.com/miekg/dns.WriteToSessionUDP
1.43s 5.72% 28.66% 4.30s 17.21% runtime.pcvalue
0.01s 0.04% 28.70% 4.15s 16.61% runtime.newstack
0.06s 0.24% 28.94% 4.09s 16.37% runtime.copystack
0.01s 0.04% 28.98% 4.05s 16.21% net.(*UDPConn).WriteMsgUDP
0.03s 0.12% 29.10% 4.04s 16.17% net.(*UDPConn).writeMsg

we are particularly intereseted in functions linked with mallocgc, this is where mark assists happen

(pprof) svg mallocgc

we can track allocation using alloc endpoint, alloc_object option means total allocations count, there are other options for in use memory and allocation space.

$ go tool pprof -alloc_objects http://localhost:6060/debug/pprof/allocs
(pprof) top -cum
Active filters:
show=handler
Showing nodes accounting for 58464353, 59.78% of 97803168 total
Dropped 1 node (cum <= 489015)
Showing top 10 nodes out of 19
flat flat% sum% cum cum%
15401215 15.75% 15.75% 70279955 71.86% arvancloud/redins/handler.(*DnsRequestHandler).HandleRequest
2392100 2.45% 18.19% 27198697 27.81% arvancloud/redins/handler.(*DnsRequestHandler).FindANAME
711174 0.73% 18.92% 14936976 15.27% arvancloud/redins/handler.(*DnsRequestHandler).Filter
0 0% 18.92% 14161410 14.48% arvancloud/redins/handler.(*DnsRequestHandler).Response
14161410 14.48% 33.40% 14161410 14.48% arvancloud/redins/handler.(*RequestContext).Response
7284487 7.45% 40.85% 11118401 11.37% arvancloud/redins/handler.NewRequestContext
10439697 10.67% 51.52% 10439697 10.67% arvancloud/redins/handler.reverseZone
0 0% 51.52% 10371430 10.60% arvancloud/redins/handler.(*DnsRequestHandler).FindZone
2680723 2.74% 54.26% 8022046 8.20% arvancloud/redins/handler.(*GeoIp).GetSameCountry
5393547 5.51% 59.78% 5393547 5.51% arvancloud/redins/handler.(*DnsRequestHandler).LoadLocation

from here on we can use list on each function and see if we can reduce memory allocation. let’s check a few :

printf-like functions

(pprof) list handleRequest
Total: 97803168
ROUTINE ======================== main.handleRequest in /home/arash/go/src/arvancloud/redins/redins.go
2555943 83954299 (flat, cum) 85.84% of Total
. . 35: l *handler.RateLimiter
. . 36: configFile string
. . 37:)
. . 38:
. . 39:func handleRequest(w dns.ResponseWriter, r *dns.Msg) {
. 11118401 40: context := handler.NewRequestContext(w, r)
2555943 2555943 41: logger.Default.Debugf("handle request: [%d] %s %s", r.Id, context.RawName(), context.Type())
. . 42:
. . 43: if l.CanHandle(context.IP()) {
. 70279955 44: h.HandleRequest(context)
. . 45: } else {
. . 46: context.Response(dns.RcodeRefused)
. . 47: }
. . 48:}
. . 49:

line 41 is particularly interesting, even when debug is disabled there’s still memory allocation, we can use escape analysis to investigate it more

go escape analysis tool is actually a compiler flag

$ go build -gcflags '-m'

you can add another -m for more information

$ go build -gcflags '-m '

for a better interface use view-annotated-file.

$ go build -gcflags '-m'
.
.
.
../redins.go:39:20: leaking param: w
./redins.go:39:42: leaking param: r
./redins.go:41:55: r.MsgHdr.Id escapes to heap
./redins.go:41:75: context.RawName() escapes to heap
./redins.go:41:91: context.Request.Type() escapes to heap
./redins.go:41:23: handleRequest []interface {} literal does not escape
./redins.go:219:17: leaking param: path
.
.
.

here, all Debugf parameters escape to heap. this happens because of the way Debugf is defined:

func (l *EventLogger) Debugf(format string, args ...interface{})

all args parameters convert to interface{} type which always escape to heap. we can either remove debug logs or use a zero allocation log library like zerolog.

for more information about escape analysis see “allocation eficiency in golang”.

string manipulation

(pprof) list reverseZone
Total: 100817064
ROUTINE ======================== arvancloud/redins/handler.reverseZone in /home/arash/go/src/arvancloud/redins/handler/handler.go
6127746 10379086 (flat, cum) 10.29% of Total
. . 396: logger.Default.Warning("log queue is full")
. . 397: }
. . 398:}
. . 399:
. . 400:func reverseZone(zone string) string {
. 4251340 401: x := strings.Split(zone, ".")
. . 402: var y string
. . 403: for i := len(x) - 1; i >= 0; i-- {
6127746 6127746 404: y += x[i] + "."
. . 405: }
. . 406: return y
. . 407:}
. . 408:
. . 409:func (h *DnsRequestHandler) LoadZones() {
(pprof)

Since Go’s string is immutable, making temporary string cause allocations. Beginning with Go 1.10 there is a strings.Builder to efficiently build a string.

(pprof) list reverseZone
Total: 93437002
ROUTINE ======================== arvancloud/redins/handler.reverseZone in /home/arash/go/src/arvancloud/redins/handler/handler.go
0 7580611 (flat, cum) 8.11% of Total
. . 396: logger.Default.Warning("log queue is full")
. . 397: }
. . 398:}
. . 399:
. . 400:func reverseZone(zone string) string {
. 3681140 401: x := strings.Split(zone, ".")
. . 402: var sb strings.Builder
. 3899471 403: sb.Grow(len(zone)+1)
. . 404: for i := len(x) - 1; i >= 0; i-- {
. . 405: sb.WriteString(x[i])
. . 406: sb.WriteByte('.')
. . 407: }
. . 408: return sb.String()

since we don’t care about the value of reversed string we can eliminate Split() by simply reversing the whole string.

(pprof) list reverseZone
Total: 89094296
ROUTINE ======================== arvancloud/redins/handler.reverseZone in /home/arash/go/src/arvancloud/redins/handler/handler.go
3801168 3801168 (flat, cum) 4.27% of Total
. . 400:func reverseZone(zone string) []byte {
. . 401: runes := []rune("." + zone)
. . 402: for i, j := 0, len(runes)-1; i < j; i, j = i+1, j-1 {
. . 403: runes[i], runes[j] = runes[j], runes[i]
. . 404: }
3801168 3801168 405: return []byte(string(runes))
. . 406:}
. . 407:
. . 408:func (h *DnsRequestHandler) LoadZones() {
. . 409: h.LastZoneUpdate = time.Now()
. . 410: zones, err := h.Redis.SMembers("redins:zones")

more on string manipulation here.

sync.Pool

(pprof) list GetASN
Total: 69005282
ROUTINE ======================== arvancloud/redins/handler.(*GeoIp).GetASN in /home/arash/go/src/arvancloud/redins/handler/geoip.go
1146897 1146897 (flat, cum) 1.66% of Total
. . 231:func (g *GeoIp) GetASN(ip net.IP) (uint, error) {
1146897 1146897 232: var record struct {
. . 233: AutonomousSystemNumber uint `maxminddb:"autonomous_system_number"`

. . 234: }
. . 235: err := g.ASNDB.Lookup(ip, &record)
. . 236: if err != nil {
. . 237: logger.Default.Errorf("lookup failed : %s", err)
(pprof) list GetGeoLocation
Total: 69005282
ROUTINE ======================== arvancloud/redins/handler.(*GeoIp).GetGeoLocation in /home/arash/go/src/arvancloud/redins/handler/geoip.go
1376298 3604572 (flat, cum) 5.22% of Total
. . 207:
. . 208:func (g *GeoIp) GetGeoLocation(ip net.IP) (latitude float64, longitude float64, country string, err error) {
. . 209: if !g.Enable || g.CountryDB == nil {
. . 210: return
. . 211: }
1376298 1376298 212: var record struct {
. . 213: Location struct {
. . 214: Latitude float64 `maxminddb:"latitude"`
. . 215: LongitudeOffset uintptr `maxminddb:"longitude"`
. . 216: } `maxminddb:"location"`
. . 217: Country struct {
. . 218: ISOCode string `maxminddb:"iso_code"`
. . 219: } `maxminddb:"country"`
. . 220: }

. . 221: // logger.Default.Debugf("ip : %s", ip)
. . 222: if err := g.CountryDB.Lookup(ip, &record); err != nil {
. . 223: logger.Default.Errorf("lookup failed : %s", err)
. . 224: return 0, 0, "", err
. . 225: }
. 2228274 226: _ = g.CountryDB.Decode(record.Location.LongitudeOffset, &longitude)
. . 227: // logger.Default.Debug("lat = ", record.Location.Latitude, " lang = ", longitude, " country = ", record.Country.ISOCode)
. . 228: return record.Location.Latitude, longitude, record.Country.ISOCode, nil
. . 229:}
. . 230:
. . 231:func (g *GeoIp) GetASN(ip net.IP) (uint, error) {

we use maxmiddb functions to get geolocation data. these functions take interface{} as parameters which as we saw earlier can cause heap escapes.

we can use sync.Pool to cache allocated but unused items for later reuse

type MMDBGeoLocation struct {
Coordinations struct {
Latitude float64 `maxminddb:"latitude"`
Longitude float64
LongitudeOffset uintptr `maxminddb:"longitude"`
} `maxminddb:"location"`
Country struct {
ISOCode string `maxminddb:"iso_code"`
} `maxminddb:"country"`
}

type MMDBASN struct {
AutonomousSystemNumber uint `maxminddb:"autonomous_system_number"`
}
func (g *GeoIp) GetGeoLocation(ip net.IP) (*MMDBGeoLocation, error) {
if !g.Enable || g.CountryDB == nil {
return nil, EMMDBNotAvailable
}
var record *MMDBGeoLocation = g.LocationPool.Get().(*MMDBGeoLocation)
logger.Default.Debugf("ip : %s", ip)
if err := g.CountryDB.Lookup(ip, &record); err != nil {
logger.Default.Errorf("lookup failed : %s", err)
return nil, err
}
_ = g.CountryDB.Decode(record.Coordinations.LongitudeOffset, &record.Coordinations.Longitude)
logger.Default.Debug("lat = ", record.Coordinations.Latitude, " lang = ", record.Coordinations.Longitude, " country = ", record.Country.ISOCode)
return record, nil
}

func (g *GeoIp) GetASN(ip net.IP) (uint, error) {
var record *MMDBASN = g.AsnPool.Get().(*MMDBASN)
err := g.ASNDB.Lookup(ip, record)
if err != nil {
logger.Default.Errorf("lookup failed : %s", err)
return 0, err
}
logger.Default.Debug("asn = ", record.AutonomousSystemNumber)
return record.AutonomousSystemNumber, nil
}

more on sync.Pool here.

there are many other small optimimazions possible, but at this point it seems we have done enough. for more information about memory optimization techniques check Allocation efficiency in high-performance Go services

results

two visualize the result of our memory optimizations we use a diagram in go tool trace called “Minimum Mutator Utilization”, here Mutator means not gc.

before optimizations:

here we have a window of about 500 milliseconds while we have almost no utilization(gc is consuming all resources) and we never get past 80% utilization in the long run. we want to have the zero utilization window as small as possible and get to 100% percent utilization as fast as possible, something like this:

after optimizations:

conclusion

by using go tools we managed to optimize our service to handle high load of requests and better utilize system recources.

you can check our source code on github. here is unoptimized version and here is the optimized version.

further readings

Arash Cordi

Written by

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade