The “better than random” model

Shahar Cohen
YellowBlog
Published in
3 min readJul 13, 2017

Although the last think you want to hear about a state of the art machine learning model is that it is just better than random, the way to a winning model often begins with a simple and naive solution, using the most basic modeling techniques.

In machine learning we always get new problems and need to figure out how to solve them. The solution involve research, which is a process that contains significant level of uncertainty. Even the best data science team might fail to deliver a valuable model, simply because valuable is sometimes infeasible. One effective way of hedging the risk is starting with a benchmark model. A good benchmark model have three characteristics:

i) It is simple and quick to build (quick means faster than a highly handcrafted model, in at least an order of magnitude).

ii) It has some positive contribution to the task that is being solved (even if it is still not accurate or valuable enough).

iii) It contains some hints about what is working well and what isn’t working well.

Due to the 2nd characteristic, we (at YellowRoad) sometimes call a good benchmark: the “better than random” model.

Now let’s be clear here. We will always search for a valuable model, business wise, and the “better than random” model is typically just a first baby step on the way. But, in many cases it is a very important baby step. Here are the main reasons for that:

  1. The “better than random” model is often a good Go/No-Go criterion: If you succeed in building a (simple and quick) “better than random” model, there are high chances that you can invest more, add more parameters (more data, more complexity) and get a significantly better model. It is not a guarantee, but assuming that you attained the “better than random” model easily (significantly faster than the entire project), by having it, you remove a significant amount of risk. We also sometimes see valuable machine learning solutions that were obtained after failing to quickly produce a “better than random” model. But, failing this way significantly increases the risk.
  2. Running the “better than random” model on real data, characterizing the cases in which it works well, versus the cases in which it fails teaches us a lot about the task that is being solved and helps us on the way to a better model. That is, in many cases, the “better then random” model is not just a risk management step, but a constructive step on the way to a really successful model.
  3. Ensemble: in machine learning it is sometimes possible to generate an ensemble of relatively weak models, and combine them into a single, surprisingly strong one. If you can generate a series of “better than random” models, you might gain a very good result by combining their sub-results.

Next time when a customer or a manager asks you how good is the solution, do not answer that it is better than random. None data scientist will probably miss the joke, but let the “better than random” model be part of your working flow.

--

--

Shahar Cohen
YellowBlog

Data science researcher and entrepreneur, helping companies to start up with AI.