Introducing fklearn: Nubank’s machine learning library (Part I)

Nubank has just open-sourced fklearn, our machine learning python library!

At Nubank we rely heavily on machine learning to make scalable data-driven decisions. While there are many other ML libraries out there (we use Xgboost, LGBM, and ScikitLearn extensively for example), we felt the need for a higher level abstraction that would help us more easily apply these libraries to the problems we face. Fklearn effectively wraps these libraries into a format that makes their use in production more effective.

Fklearn currently powers a large set of machine learning models at Nubank, solving problems ranging from credit scoring to automated customer support chat responses. We built it with the following goals in mind:

  1. Validation should reflect real-life situations
  2. Production models should match validated models
  3. Models should be production-ready with few extra steps
  4. Reproducibility and in-depth analysis of model results should be easy to achieve

Early on we decided that functional programming would be a powerful ally in trying to achieve these goals.

F is for Functional

Here at Nubank we’re big fans of functional programming, and that isn’t limited to the Engineering chapter. But how does Functional programming help Data Scientists?

Machine Learning is frequently done by using object-oriented python code, and that’s the way we used to do it at Nubank as well. Back then, the process of building machine learning models and putting them into production was tiresome and often full of bugs. We’d deploy a model only to find that predictions made in production didn’t match the ones seen during validation. What’s more, validation was often impossible to reproduce, frequently being done in stateful Jupyter Notebooks.

Functional programming helps fix these issues by:

  • Making it easy to build pipelines where the data transformations that happen during training match the models in production.
  • Allowing for safer iteration in interactive environments (e.g. Jupyter Notebooks), preventing mistakes caused by stateful code and making research more reproducible.
  • Allowing us to write very generic validation, tuning and feature selection code that works across model types and applications, making us more efficient overall.

Let’s go through an example to see how functional programming does this in practice. Let’s say we’re trying to predict how much someone will spend on their credit card based on two variables: monthly income and previous bill amount. As the output of this model will be used for sensitive decision making, we’d like to make sure it is robust to outliers in the input variables, which is why we decide to:

  1. Cap monthly income to 50,000, since income is self-reported and sometimes exaggerated.
  2. Limit the output range of the model to the [0, 20,000] interval.

And then use a simple linear regression model. Here’s what the code looks like:

Don’t be alarmed! We’ll go through the code step by step explaining some important fklearn concepts.

Learner functions

While in scikit-learn the main abstraction for a model is a class with methods fit and transform, in fklearn we use what we call a learner function. A learner function takes in some training data (plus other parameters), learns something from it and returns three things: a prediction function, the transformed training data, and a log. The first three lines of our example are initializing three learner functions: capper, linear_regression_learner, and prediction_ranger.

To better illustrate, here’s a simplified definition of the linear_regression_learner:

Notice the use of type hints! They help make functional programming in python less awkward, along with the immensely useful toolz library.

As we mentioned, a learner function returns three things (a function, a DataFrame, and a dictionary), as described by the LearnerReturnType definition:

  • The prediction function always has the same signature: it takes in a DataFrame and returns a DataFrame (we use Pandas). It should be able to take in any new DataFrame (as long as it contains the required columns) and transform it (it is equivalent to the transform method of a scikit-learn object). In this case, the prediction function simply creates a new column with the predictions of the linear regression model that was trained.
  • The transformed training data is usually just the prediction function applied to the training data. It is useful when you want predictions on your training set, or for building pipelines, as we’ll see later.
  • The log is a dictionary, and can include any information that is relevant for inspecting or debugging the learner (e.g. what features were used, how many samples there were in the training set, feature importance or coefficients).

Learner functions show some common functional programming properties:

  • They are pure functions, meaning they always return the same result given the same input, and they have no side-effects. In practice, this means you can call the learner as many times as you want without worrying about getting inconsistent results. This is not always the case when calling fit on a scikit-learn object for example, as objects may mutate.
  • They are higher order functions, as they return another function (the prediction function). As the prediction function is defined within the learner itself, it can access variables in the learner function’s scope via its closure.
  • By having consistent signatures, learner functions (and prediction functions) are composable. This means building entire pipelines out of them is straightforward, as we’ll see soon.
  • They are curriable, meaning you can initialize them in steps, passing just a few arguments at a time (this is what’s actually happening in the first three lines of our example). This will be useful when defining pipelines, and applying a single model to different datasets while getting consistent results.

It may take some time to wrap your head around all this, but don’t worry, you don’t need to be an expert in functional programming to use fklearn effectively. The key is understanding that models (and other data transformations) can be defined as functions following the learner abstraction.

Pipelines

Machine Learning models rarely exist on their own however. By focusing only on the model, Data Scientists tend to forget what transformations the data is going through before and after the ML part. These transformations often need to be exactly the same when training and deploying models, and Data Scientists might try to manually recreate their training pre- and post- processing steps in production, which leads to code duplication that is hard to maintain.

Learner functions are composable, meaning two or more learners combined can be seen as just a new, more complex learner. This means that no matter how many steps you have in your pipeline, your final model will behave just the same as a single one, and making predictions is as simple as calling the final prediction function on new data. Having all the steps in your modeling pipeline contained in a single, pure function also helps with validation and tuning, as we can pass it around to other functions without fear of side effects.

In our example, our pipeline consists of three steps: capping the income variable, running the regression then constraining the regression output to the [0, 20000] range. After each learner is initialized, we build the pipeline and apply it to the training set using these two lines of code:

The learner variable now contains the pipeline resulting from composing the three learner functions, and is applied to the training data to yield the final prediction function. This function will apply all the equivalent steps in the pipeline to the test data, as the image below illustrates:

Example of how data flows through a pipeline when training, and through a prediction function when predicting. The prediction function itself is returned by the pipeline; it is the composition of the three prediction functions generated by each learner when the pipeline was first called on the training data. The logs are a combination of the logs coming from all learner functions in the pipeline.

What’s next?

We’ve seen how models and data transformation steps can be written as learner functions, and how functional pipelines in fklearn help us guarantee that transformations done during training and validation match those done in production.

In Part II of this blog post (coming soon) we’ll talk about model tuning and validation, and the tools fklearn provides to make those steps more effective.

In the meantime, we invite you to try fklearn for yourself! We don’t expect fklearn to replace the current standards in ML, but we hope it starts interesting conversations about the benefits of functional programming for Machine Learning.

Interested in Data Science and the exciting products being built at Nubank? We’re hiring!