Optimizing Elixir Phoenix Performance

TL;DR: How I upped my API performance from 15 to 1079 req/sec:

  • test in production mode (use MIX_ENV=prod)
  • remove unnecessary Logging (use config :logger, level: :warn, compile_time_purge_level: :info.)
  • use :observer.start() from IEX and look at the Processes tab
  • use :fprof from IEX for detailed call stack information (see below)
  • fix bugs

So I had written a REST API proxy using Phoenix. It accepted REST requests from a mobile app, decoded the attached JSON Web Token, pulled out the userid, did a lookup to get the corresponding backend, and then proxied the request to that backend, returning the result to the mobile app.

That went quite well, writing it was a fun learning exercise. The server is now in ‘eat-your-own-dogfood’ testmode at my client’s company, and I figured: it’s looking good.

But, while I earlier benchmarked the API backend to 13000 API req/sec, my shiny Elixir Phoenix proxy could only do 15!! Something clearly was wrong. So I decided to profile the darn thing. And here’s where the fun started.

First win

But first, I could immediately score a quick win: I am in Europe, and the backend servers were in AWS in the US. My local proxy on had to go to the US for each backend request! So I replicated a backend server in a VM on my laptop, which upped the performance to 150 req/second. But that wasn’t enough by far.

As it turned out, it’s not easy to find info on profiling a Phoenix app.

Development mode != production mode

My first Google hit turned out to be a Profiling Elixir by Joseph Kain. It mentions ExProf and fprof. Joseph tried ExProf but found it lacking so I skipped immediately to fprof. Only… I could not get from his post how to use it with Phoenix. It’s an erlang tool and the erlang manpages were confusing to me, I did not know how to apply them to Elixir Phoenix.

Then on StackOverflow someone recommended observer, a graphical erlang tool to inspect what your program is doing. Run it as follows:

Rons-MBP:rapi raarts$ iex -S mix phx.server
Erlang/OTP 20 [erts-9.2] [source] [64-bit] [smp:8:8] [ds:8:8:10] [async-threads:10] [hipe] [kernel-poll:false] [dtrace]
16:51:18.133 function=start_link/3 line=100 [info] Running RapiWeb.Endpoint with Cowboy using http://0.0.0.0:4000
Interactive Elixir (1.5.3) — press Ctrl+C to exit (type h() ENTER for help)
iex(1)> 16:51:20 — info: compiled 6 files into 2 files, copied 15 in 1.5 sec
iex(2)> :observer.start

And the observer window pops up:

The tab you’ll need the most is the Processes tab. In my testing it showed the following result (while I hammered my server with API requests):

Note the lines are ordered by Reds, which stands for ‘reductions’. This is an erlang concept, which is not the same as CPU usage, but it’s close enough.

A lot of questions popped up in my mind after seeing this:

  • what is the CodeReloader server and why is it on top?
  • what is the file_server process and why is it so high, I’m not doing any File I/O
  • Why is the Elixir.Logger so high? I don’t get any logging output during my testrun, and I’m using lazy logging.
  • what is the code server process doing during my test run?
  • why are all those cowboy_protocol.init processes so high, I would expect initialisation being quick, but other functions more expensive.
  • the jose_server is the only thing I can understand, it decrypts the incoming JWT on each API call.

So, turns out I was running in :dev mode, and in dev mode configuration is different. For example you can edit your source code, and on the next HTTP request the Code Reloader will detect the change and quickly recompile the code. So when I hammered the server with requests, to Code Reloader was continuously checking all my source files, which also explains why the file_server just below it was so busy. I switched to production mode (MIX_ENV=prod) and performance tripled to 450 req/sec

Closely inspecting ONE request

This particular improvement came to light when I used tcpdump to trace what my network traffic was, as the result of ONE incoming request. tcpdump is a really useful tool, I use it a lot.

My app is just an API proxy. The mobile apps talking to it get a JWT from an OAUTH server, but talking to the backend another token is used, and a user session must be setup with that token. So my app needs to check on each incoming request if that user already has an active session with her backend server, and if not my app should set that up on the fly. That session stays open for a while so the backend token can be re-used on the next request.

Well, tcpdump showed that the session check did not work, and for each incoming request a new session was setup with the backend, which cost another two API calls. Fixing that improved the performance to 750 requests per second.

Logging costs CPU even if not displayed

The two tops lines disappeared in production mode, but the Logger was still there. Strange because I had set the loglevel as high as :error and no output was shown. Now, I like to do a lot of logging, and my log statements pretty much describe all decisions made in a program, so for each HTTP request there were like tens of logging statements present in the code. That tends to add up, especially if you do hundreds of requests per second.

Well: you can tell the Elixir compiler to completely remove those statements from production: use the compile_time_purge_level flag in the logger configuration. Using this flag improved the performance to 1050 requests/sec.

Digging deeper using fprof

Since the :observer tool did not help me any further, I decided to get a more granular view of what happened using fprof. Here’s how to do that:

iex(5)> :fprof.trace([:start, {:procs, :all}])
:ok
iex(6)> :fprof.trace(:stop)
:ok
iex(7)> :fprof.profile
...................................,
(and a whole lot more lines)
End of trace!
:ok
iex(8)> :fprof.analyse([callers: true, sort: :own, totals: true, details: true])
(followed by > 1000 lines of output)
Done!
:ok

Note that fprof generates huge output files (fprof.trace in the current directory), even when tracing for only a minute.

Here is the first part of the fprof.analyse() call:

iex(5)> :fprof.analyse([callers: true,sort: :own,totals: true,details: true])
Processing data...
Creating output...
%% Analysis results:
{ analysis_options,
[{callers, true},
{sort, own},
{totals, true},
{details, true}]}.
%                                               CNT       ACC       OWN
[{ totals, 4774694,17162.356,14837.297}]. %%%
{[{{base64url,'-decode/1-lbc$^0/2-0-',2},      206492,    0.000,  829.718},
{{base64url,decode,1}, 1083, 1191.294, 5.680}],
{ {base64url,'-decode/1-lbc$^0/2-0-',2}, 207575, 1191.294, 835.398}, % <----------
[{{base64url,'-decode/1-lbc$^0/2-0-',2}, 206492, 0.000, 829.718},
{{base64url,urldecode_digit,1}, 206492, 350.686, 348.324},
{garbage_collect, 620, 5.210, 5.210}]}.
{[{{hackney_bstr,'-to_lower/1-lbc$^0/2-0-',2}, 170031,    0.000,  685.779},
{{hackney_bstr,to_lower,1}, 14079, 1070.261, 62.526}],
{ {hackney_bstr,'-to_lower/1-lbc$^0/2-0-',2}, 184110, 1070.261, 748.305}, % <----------
[{{hackney_bstr,'-to_lower/1-lbc$^0/2-0-',2}, 170031, 0.000, 685.779},
{{hackney_bstr,char_to_lower,1}, 170031, 306.065, 305.583},
{garbage_collect, 970, 15.891, 15.891}]}.

(Unfortunately this looks rather bad, and Medium does not support smaller fonts.)

The ACC column means accumulated time, this is for each function the total time spent in this function including subfunctions. The OWN column is the same, but without the subfunctions. Normally you want to sort by OWN to check which functions take up the most of the CPU.

Note the output is divided in blocks, and in the middle of each block there’s a line ending with a ‘%’ sign. That is the function each block is talking about, the other lines are the parents, and the childs of this function. From here you’re on your own.

In my case thefprof output showed that there were no more outliers with respect to performance, so this is where I stopped.

This is my first story on Medium, so I’d appreciate comments.