An algorithm for caching predictive models

The most common task in machine learning is to construct a predictive model for a specific problem . The predictive model can be a simple binary classifier , a regression model or something more exotic. The bulk of machine learning literature is concerned exactly with how to construct such models .

In the last two decades the fast development of WWW and continuous digitization of our world pushed the limits of existing infrastructure and algorithms , giving rise to what we usually call “big data” . Big data can mean ,beside the obvious “larger” size , higher data velocity , heterogeneity and basically any property of data that when is above a certain threshold put problems to a classical system that works just fine otherwise .

In this paradigm shift machine learning was from the beginning on the first line of the front . The unprecedented availability of data created new and exciting opportunities , but also some specific challenges . More data usually mean better models , but scaling a learning algorithm to make it able to use all available data require changing the implementation (to use a distributed/cloud environment or GPUs) and sometimes even the underling principles (now simple optimisation methods like stochastic gradient descent are preferred to more sophisticated, but less scalable techniques like quasi-newton methods).

In this short text we try to describe and present a simple solution for a less known problem faced by developers of learning systems for big data : making predictions in real time when the number of models is very large . Here “very large” is equivalent to the fact that the total size of the models is larger then the available RAM .

Basically , we have the following scenario . We receive a labeled dataset and we want to make predictions for some unseen instances . The dataset can be naturally partitioned in a number of subsets such that is more convenient (from an accuracy and training time point of view ) to have one predictor per subset instead of only one predictor for all data . At prediction time the data arrive in an online fashion and we want the system to work in real time.So far , so good ; this looks like a standard machine learning problem. As we mentioned before the challenge is that the models can be just too big to store them all in the internal memory.

To make things more concrete, let’s look at the specific problem that we encountered . We have a bunch of ads for used cars . The price is the “label” and the features are things like year of production , mileage and motor power .The task is to predict the price for new ads that arrive in real time . In order to do this we split the data based on the car model and for each group we train a regression model (what kind of model is used is not important for this discussion). Then ,when a new ad is received, we check the car model and apply the appropriate predictor . Of course it’s nothing special about car ads , the ideas apply more generally , but to ease the presentation we stick to this context .

In order to make fast predictions , ideally we will want to keep all predictive models in RAM . Due to limited available memory ,this is not always possible (at least not under budget constraints … ).To minimise the model load operations (which are expensive, even on a machine with SSD) , we use the not-so-new idea of implementing a caching mechanism for the models, such that to make ready available those models that are most likely to be used . In other words, we try to optimise the prediction speed using an efficient memory allocation policy.

We define optimality with respect to the cache hit rate . This is usually a good choice . Even if the models size vary on a wide range , since larger models have a higher load time , hit rate is still relevant .

The approach we take is to estimate the probability of occurrence for each car model , based on the available data .To this end we make two assumptions about the data:

1.Each Ad is drawn independently from some unknown probability distribution . Less formal : the ads I saw so far don’t tell me nothing about what ad I will see next , beside the information on probability distribution they provide.

2.The distribution from which the ads are drawn is fixed or change slowly over time .

In many settings this assumptions are quite natural. But even if they are not exactly true, the algorithm still can work.

The algorithm has two main components : a probability estimator which try to establish the likelihood for ads and a caching decision method which say if we need to change the models loaded in memory.

Usually ,the probability of an event is approximated by the frequency of that event . In our case, because the probability distribution may change in time , it is desirable to weight the data differently ,based on time — an occurrence becomes less and less relevant as the time passes .Having a good memory it’s ok, having an eidetic memory it’s a nightmare !

It can be very useful to look at an example .Let’s consider we have received ,until some arbitrary time t, the following ads ( specified by the car model ) : VW-GOLF , OPEL-ASTRA, VW-GOLF, VW-PASSAT, VW-PASSAT and VW-GOLF .

The “probability” at time t for VW Golf will be approximated as :

P(t,VW-GOLF) = ((((d + 0)d + 1)d + 0) d + 0)d + 1

,where d is a subunitary “dumping factor “ (we use the value 0.999 ) which is needed to “discard old information” . Note that this quantity is not really the probability, since we don’t do normalisation , but it is all we want since the actual values are not important , we only need to be able to compare them .

The algorithm works as follows . As long as there is enough memory, we load additional models as soon as they are required for predictions. When some memory threshold is exceeded, we stop adding new models . In this case we replace a predictive model that is already loaded if we have a cache miss and find a new car model that is significantly more likely to occur than the least probable car that has it’s predictive model in RAM . Of course, after each received ad the probabilities are updated by multiplying all “probabilities” by the dumping factor and adding 1 to the “probability” of the observed model. Simple, as advertised!

The different parameters used — relevance threshold , dumping factor etc. are set on an empirical bases — good old fashion trial and error strategy .

The following Python code represents a basic simulation of the algorithm:

import random
# Here the items are just some int value
make_models = [int(random.gauss(5, 2)) for i in range(100)]
dict_models = {}
dict_stats = {}
# Relevance coefficient
_REL_COEFF = 1.2;
# Relevance increment
_REL_INC = 1.0;
# Damping factor
_DF = 0.999;
free_memory = 100
memory_limit = 50
_dictionary_ = {}
_dictStats_ = {}
min_fr = float("inf")
min_key = ""
# System receive a stream  of items
for make_model in make_models:
# Update the statistics
if make_model in _dictStats_:
_dictStats_[make_model] = _dictStats_[make_model] + _REL_INC
if make_model in _dictionary_:
if _dictStats_[make_model] <= min_fr + _REL_INC:
min_fr = min_fr + _REL_INC
min_fr = float("inf")
for (key,value) in _dictStats_.items():
if min_fr > value and key in _dictionary_:
min_fr = value
min_key = key

else:
_dictStats_[make_model] = _REL_INC
if min_fr > _dictStats_[make_model] and make_model in _dictionary_:
min_fr = _REL_INC
min_key = make_model

for key in _dictStats_.keys():
_dictStats_[key] = _dictStats_[key] * _DF
min_fr = min_fr * _DF

# Load/change the predictive model if required
if not (make_model in _dictionary_):
if free_memory - 10 > memory_limit:
_dictionary_[make_model] = make_model
free_memory -= 10
if _dictStats_[make_model] < min_fr:
min_fr = _dictStats_[make_model]
min_key = make_model
else:
if _dictStats_[make_model] > _REL_COEFF*min_fr:
del _dictionary_[min_key]
_dictionary_[make_model] = make_model

min_fr = float("inf")
for (key,value) in _dictStats_.items():
if min_fr > value and key in _dictionary_:
min_fr = value
min_key = key

Now a few concluding remarks . This caching policy is quite lightweight: updating the probabilities and finding the model that needs to be replaced are both linear in the number of models . Implementing the algorithm can be a easy high-school homework exercise . Yet , on the right time and place , it can be quite useful !

Finally , we lay down two interesting theoretical questions to keep you busy while waiting for green light:

  1. In respect to the hit rate metric and under the assumptions we made (feel free to make them quantitative in a reasonable way) , is any other algorithm , no matter how complex , but which uses the same data , significantly better then this one ?
  2. Is there a constant time algorithm that , in the same conditions, perform (asymptotically) just as well as this one?

P.S. I sort of feel that the answers are “no” and “no” .