Glicko2 Benchmarking 1
Erlang, OCaml and Golang
One of my hobby projects is to run statistics on Quake Live Duel matches. I began collecting data around 1st of Feb 2013 and now I have scraped around 2.5 Million duel matches. This allows me to play with different ranking methods on the players and gauge their ranking.
The ranking system I use is called Glicko2, devised by Mark E. Glickman (http://glicko.net). The system is like the chess rating system ELO, except that it is newer and avoids certain problems with ELO.
When ELO was conceived, you had to be able to run the calculations by hand. Glicko2 can use a computer and thus carry out much harder calculations. So it tend to deliver better results. Glicko2 tracks three values for each player. His rating R, starting at 1500. His rating deviance, RD, starting at 350 is a measure of how much we trust the rating R. If the RD number is small, we have strong belief in the rating of the player. If it is high, we don’t yet know a lot about that player. As the player plays more matches and we learn more about the player, we shrink RD towards 0. Finally a value, Sigma, measures how much a player is fooling the rating system. This allows us to compensate for quickly improving players so they don’t get “stuck” on a certain rating.
When considering a new rating for a player, we consider a weeks worth of duels for the player. We update his R, RD and Sigma values depending on the values from the previous week and the opponents he played against. If the player has a high RD for instance, his rating is moved more per win or loss since we know less about him yet. This means we quickly find the skill level of a given player.
The system I have made is written in Erlang. This choice has been very fruitful. First of all, I have a system which was easy to write and scales well, even though I don’t really need that. Second, the fault tolerance of Erlang has helped me a lot. The system has been very robust due to the fault tolerance of Erlang. Usually I don’t care about the system. It takes care of the network being down or Quake Live being upgraded by itself.
Storage is handled by Postgres. If you need a database that just works, then picking Postgres is rarely a problem. Furthermore, the complete data set is less than 6 gigabytes, so almost any kind of store would work. But Postgres is a simple choice due to its reliability and feature set.
The Erlang system works well for the fetching of matches, and for display of match scores and so on. But I need to carry out some tunings of the Glicko2 parameters. This means I will run through my 2.5 million matches many many times and thus the speed at which I can run Glicko2 matter. I have an Erlang implementation for a simulated annealer and ranker. But since it is heavyweight numerical processing—not an Erlang strength—I need to find another language for doing that.
So there are three things I need to know:
- How fast can I run the Glicko2 ranking codes? That is, how quickly can I execute the main loop for a single simple ranking. This will be needed to understand how much I could hope to gain by going to another language in the first place. If I can’t get enough speed, I can simply abort the process right here.
- I have 2.5 million matches and need to process them all. Thus the problem is switching from being CPU-bound to being memory bound. I need to find out if this change affects the processing speed of other languages. Again, if I can’t do it faster, then I need to abort the task.
- I need to run the 2.5 million ranking runs either in an algorithm running simulated annealing or a gradient search. This means potentially millions of runs times the 2.5 million ranking runs. This is the long-term goal I wish to reach. In this part I require the speed much more than in the other parts.
A slightly smaller problem is that my current ranking runs require memory proportional with the number of matches. When I had 400.000 matches it was easy to fit in memory. But now, I am running up against a barrier of the machine doing the computations (it doesn’t have too much memory, it is an old machine).
In the following I present some Timings. These are run on a Linux Laptop workstation, where it is plugged into power and runs full processor speed. The machine is a “Intel(R) Core(TM) i7-3720QM CPU @ 2.60GHz” which is an Ivy Bridge machine, 4 cores, two HTs per Core.
I intend to tune on this machine, so it is paramount the rankings are running fast on this machine.
Erlang — erl-glicko2
The Erlang system itself can run a single Glicko2 round of one player against 3 players in 22us. The benchmark in Erlang is carried out by running the test 300 times and then averaging. This batch size seems to be quite stable.
The speed figure of Erlang is with the standard BEAM bytecode interpreter. And the code is straightforward with no tuning whatsoever. With 2.5 million matches it takes just around a minute to run through them all, which is acceptable. But since I am to run faster than that, I wanted to know how much faster I could go.
Compiling with HiPE gives 8us. This is much better and far more in the area where I would like to be. But perhaps we can do better by switching the language.
Ocaml — o-glicko2
One of my favorite languages when I need fast processing is OCaml. The language has a large number of beneficial properties—static typing, a native code generator producing fast executables, a good module system and a nice eco-system. So naturally, transcribing the code from Erlang to OCaml has to be tried.
The code is quite straightforward to change from Erlang into OCaml. There are few places in the code base where we use anything but float types, so it is easy. The only slight problem was a missing parenthesis group which made a subcomputation produce the wrong result. Luckily, I have extensive tests so it was caught quickly.
Note that the particular error is one of those which will not be caught be a type system. In numerical code, everything is of type float anyway, so there is no way I can hope to catch this kind of error straight away. Type systems help a lot with symbolic processing, but this task is not one which has that property.
The OCaml code is written in idiomatic style. Functional, closures, and so on. I could opt for a more imperative style—which Ocaml allows—but for what purpose?
I use the Core_Bench module from Janes St. to do my benchmarking. The nice thing is that this tool predicts the batch size to use and also has prediction that avoids making the wrong conclusions. The OCaml bytecode interpreter clocks in at 28us. This result somewhat surprises me. I had expected the run to be faster than Erlang. But I guess more time is spent optimizing the Erlang interpreter than the OCaml bytecode interpreter and code generator.
Running native yields a time around 2us. This number is really good. If we are to process 2.5 million matches, we can do so in 2.5 * 2us (assuming no cache hierarchy) or 5 seconds. Much better than the minute it would require in Erlang.
Yet, the problem here is that the OCaml core is not parallel. I would need to cut up the data set into pieces and then run an OCaml process per core. There is no need to do that in Erlang. So even though the results will clock in faster, the problem is that I will need more work to fetch data in parallel later on. So parallelism might become a problem going forward.
Golang — Glocko2
Naturally, I had to try Golang next. This language is interesting because it has nice semantics, fixing most of the things I hate about C. It compiles to native code with a standard slightly optimizing compiler. And it supports multiple cores in its runtime, which is needed if I want to get parallelism inside a single process later on.
Writing the code in Golang is a chore. Productivity is definitely slower than in Ocaml since you have to type more and waste precious time reframing the nice functional problem into imperative code. Even though this was the last thing I implemented, it still took about twice as long as the OCaml implementation.
Do note: My imperative skillset is there, but I don’t write much imperative code nowadays. This could be a factor in the slower writing speed. However, writing last should be a help, rather than a hinderance.
What is so nice about Golang is the tooling. I set up go test early on in my editor, so when I saved a file it would automatically compile and run tests. See http://github.com/eaburns/Watch for the tool I use in Acme to do this. This meant I could start by writing down all the test cases and then go work on the implementation afterwards. As more and more of the code base began to work, I had fewer and fewer failing test cases.
Benchmarking can be done with the Go Testing tools as well. It also computes a batch size and gives you predictive results, like in Ocaml. But in this case, it is built into the default tooling. I cannot stress how important it is for a language to have nice access to profiler tooling and so on inside the default language distribution. The fact that the build tool does testing and benchmarking as well by default is just awesome.
Being Go, the compile times are faster—but this doesn’t matter for this problem as the compile times are “not noticable”. Go clocks in at 1us. About twice as fast as the OCaml solution. This is with the default compiler written by Ken Thompson initially and improved by many other people.
Recap
So we have:
- OCaml bytecode: 28us
- Erlang bytecode: 22us
- Erlang HiPE: 8us
- OCaml native: 2us
- Go: 1us
There is no solution which totally aborts at this point. I already have an Erlang implementation, and the numbers may change around when we add the next layer—processing 2.5 million matches. Before I add that and have the option to do profiling, I’d rather not try to hand-optimize these results too much right now.
Tuning Tricks
When you revisit the same algorithm in multiple languages, you see possibilities for optimizations all over the place. There are some subcomputations, the g and e functions, which I don’t know if it is worth to compute once and then stash away in memory. I could probably lower write memory pressure and GC by recomputing them when I need them.
Also, all of Glicko2 runs on a scaling factor of 173.7178. This means that before doing anything with the given R and RD values, you scale them down by this factor. All computation are carried out on the downscaled numbers. The final step is to upscale everything again. A trick which I am seriously considering is to scale down everything before starting my runs. This avoids a down scale and an upscale in each loop and this would help a lot for the larger computations where many runs are needed.
One of the major Glicko2 steps is to find a root of a function. I am currently using a root-finder, called Ridder’s method. This finder is quite fast, but it is also the major slowdown in the runs. When I first implemented the OCaml variant, I picked a different root-finder by mistake. This meant that it ran in 0.6us, so definitely this part of the code base is the contending one. It also suggests that the Golang implementation is handling this part differently than the OCaml code and there is definitely room for improvement in the OCaml code.
Parting words
In the next phase, the code to read, parse and compute on 2.5 million lines of code has to be written. I have no time frame for doing so, as I am mostly doing this “for fun and entertainment”. I am pretty sure you can optimize the code bases like mad, but there is little reason to do so before the other parts have been implemented. The problem will quickly be memory bound, so the interesting things in speeding it up will be in-memory representation.
My initial ideas is to store data in a vector-like format. In Erlang I use an ETS-table, but this incurs a hash-table lookup a large number of times. My profiling shows I spend 50% time in ets:lookup_element/3 in Erlang. So to go faster, I need to pack these data better in memory. It might very well be that the numerical code is not the hottest path in this program at all. So I hesitate to optimize it.
This is also the reason why I considered BER MetaOCaml, but lost interest in using it again before I know that I can get decent speed on the other parts. There are ways to make this parallel even thought the OCaml runtime is not. Perhaps I can work around that, but I will note the extra cost in time to do so.
I also considered Haskell. Given Repa or Accelerate, you can probable speed up the computation and move it to the GPU. It is an interesting project, but it requires a completely different approach to the problem at hand. One could also use the Erlang OpenCL bindings to achieve something like this.
Finally, if you were a company, none of this would have been needed. I already had the Erlang code for tuning. I would just have had to lease the next machine size in Amazon Web Services for 24 hours. I don’t need to run tuning that often. Once a year or so is perfect. And I can run weekly updates in a minute. If I cut the dataset into 10 pieces, and load it from Postgres a little bit at a time, then I could definitely do this inside a 3 minute window. This is hardly a problem. It is not worth doing this from a Cost/Benefit perspective. And frankly, writing the code in Erlang is probably faster than writing it in OCaml or Go for me.