Callbacks — A must-have tool for data scientists

Collaborate on and maintain data scientific code like a “PRO”grammer.

From a data science perspective, callbacks are a great pattern to reduce redundancy in code.

But that’s not how I was introduced to them. For me, callback brings to mind javascript, which was where I first heard the word. I remember it something like…

…because callbacks in javascript are gently introduced like this:

$("button").click(function(){
$("p").hide("slow", function(){
alert("The paragraph is now hidden");
});
});

which quickly turns into what is known as ‘callback hell’. Google that at your own risk.

For this article, if it’s rolling around your grey matter, get javascript out of your mind! Not all callbacks are asynchronous or have the(error, result) function signature.

So let’s define callbacks more generally:

a function, which is passed as an argument to another function, and executed later.

I used callbacks for years and never knew about the term ‘callback’.

R and Python (pandas) have a common use for callbacks that you’ve probably used before — the apply function.

Python

d = pd.DataFrame({'a': range(1,11), 'b': range(11,21)})
d.apply(np.mean)
a     5.5
b 15.5

R

l <- list(a = 1:10, b = 11:20)
sapply(l,mean)
a    b
5.5 15.5

When would you want to implement this pattern in your work?

Consider this motivating example: without apply, you’d have to write a function any time you wanted to execute a function over a dimension of a data frame.

For example (pseudocode alert!):

def my_apply_mean(df):
res = []
i = 0
while i < in nrow(df):
res[i] = mean(df[i,])
i += 1
def my_apply_median(df):
res = []
i = 0
while i < in nrow(df):
res[i] = median(df[i,])
i += 1

Instead, apply accepts an arbitrary function as an argument and calls it on the data frame elements, like so:

def my_apply(df):
res = []
i = 0
while i < in nrow(df):
res[i] = median(df[i,])
i += 1

The symptom to watch out for here is repeated structure of several functions with very little difference, except for a call to different functions at the same place within that structure.

An alarm bell is multiple function calls with similar signatures inside another function.

def simulate(years, **kwargs):
result = []
    for year in range(years):
property_a = estimate_property_a(year,
kwargs['other_property_a'])
property_b = estimate_property_b(year,
kwargs['input_property_b'])
property_c = estimate_property_c(year,
kwargs['some_property_c'])
# repeated several more times...
    result.append({
'property_a': property_a,
'property_b': property_b,
'property_c': property_c,
# repeated several more times
})
    return result

That’s not verbatim, but you get the idea.

def do_yearly(funcs, **kwargs):
output = []
for year in range(years):
for fun in funcs:
result = fun(year, ???)
output.append
    return output
def simulate(**kwargs):
return do_yearly(funcs=[estimate_property_a,
estimate_property_b,
estimate_property_c],
**kwargs)

The tricky part is getting the other arguments into each of the property estimators.

If each property estimator takes arbitrary keyword arguments, and there aren’t conflicts, then the example above is sufficient. Instead of kwargs, another option is to pass a class instance that holds the required state. There’s other design patterns that could be used, such as the action/executor. At this stage, that looks like overkill to me.

At this point, this is all reflection as I haven’t done this refactor. If something has worked well for you in this situation, let me know!

Thanks to Spencer Cox, Lori Logan for improving this post. Special thanks to Yves Richard for the spirited discussion about callbacks, async, and function composition.