Benchmarking LightGBM: Float vs Double

Laurae
Data Science & Design
4 min readJan 10, 2017

We have seen previously that LightGBM was extremely fast, much faster than xgboost with default settings in R. Recently, to fix a prediction bug in LightGBM, a switch from float to double (for prediction-related functions) was made to fix that issue. Now that the change is official, let’s try some benchmarks to check for potential differences in performance.

Global overview

Let’s not keep quiet, and go straight to the results of the benchmark. About the RAM usage, we went from approximately 1,425 MB (float) to 1,434 MB (double) during training, not a big deal for over 1 million observations.

View interactively online: https://plot.ly/~Laurae/13/

The difference does not seem negligible, but it is in fact negligible for most of it. We are looking at a difference of up to 3.3 seconds for a single thread of 200+ seconds, which is about 1.7% of performance loss by switching from float to double.

Remember that a double has twice the length representation in memory than a float. As we increase the number of threads, the loss trend of switching from float to double becomes either reverse or null.

Relative performance

Guess what is better? You can try.

View interactively online: https://plot.ly/~Laurae/15/

The higher the number of threads, the less likely the performance impact is. This is something positive, as long as you run many threads to put the burden on the CPU. If you can’t do multithreading, you might end up with a slower program when switching from float to double. Otherwise, the difference is negligible, if not all about random noise.

The crucial question: can we explain this?

Caching fun

Do you know what is caching? This is what we will talk about quickly here. And also, why caching is important and crucial.

Why on the hell would a double memory representation be better than a single memory representation?: This is all about (CPU) caching.

If you cache well, then you will retrieve information very fast. For instance, in conditional loops, a cache can be the deciding factor in performance, by up to 3X or sometimes even more! (just for using a cache?!)

In LightGBM case, we are talking about multithreading and caching. If you fill the cache with “bad” data, then you are filling the cache poorly and lose time.

Caricatured explanation

Take this scenario:

  • You are expecting to cook some nuggets for dinner (what you are looking for)
  • The first (L1) “cache” you have are cooked cookie plate next to you
  • The second (L2) “cache” you have are cooked pizzas in the next piece
  • The third (L3) “cache” you have is the cooked bowl one stair higher
  • The fourth (RAM) “cache” you have is to cook the nuggets in the kitchen

Questions:

  • Are you expecting to cook your nuggets in the kitchen faster than moving upstairs to check the bowl? No, obviously. You would lose time.
  • Are you expecting to do the latter faster than moving in the next piece? No, of course. You would lose time.
  • Are you expecting to do the latter faster than checking the cookie plate next to you? No, that’s 100% sure you can’t do it faster. You would lose time.
  • How much time you would lose by cooking the nuggets against taking the nuggets in the cookie plate (if it were there)? A lot!

This is approximately what a cache miss penalty is and incurs (in a very caricatured way). Cache misses happens so much you are not even seeing them yourself.

Good caching

Imagine now you are not getting any cache misses, the perfect scenario. If your caches are limited, you can’t fit all those nuggets in the cookie plate. You would put them on your pizza afterwards, then on the bowl one stair higher, until you have to cook nuggets by yourself.

If you halve the cache on the previous scenario, you are more likely to go to your pizza, to go upstairs, and in the end cook them yourself. When you switch from float to double, you are doubling the memory usage required. Therefore, if you can’t fit as many nuggets as you could before, you end up slower with doubles against floats!

Bad caching

Imagine now you are putting a lot of junk in your caches, like pizzas and cookies. If you go to your cookie plate and see cookies, you are unhappy and go to the pizza. But you are seeing pizza, so you go upstairs. You don’t find your nugget, so you go to the kitchen doing it by yourself, and you lost a lot of time!

When you halve the caches, you are more likely to go the kitchen faster than initially. Guess what?: caching is so poor you would rather go to the kitchen more than initially. If you have less objects in the cache to check, then you are likely faster: which is exactly what happens when you do multithreading. If you fill cache with junk, then you would get junk. But with less junk, you are getting faster than too much junk.

Next benchmarks?

I am still waiting for xgboost fast histogram method to work in R and be merged with the master branch. Then, I would compile xgboost and start doing benchmarks to compare xgboost and LightGBM in an “Apple to Apple” comparison!

--

--