Prototyping Data Science Features

Socure

Published in

The Socure Technology Blog

5 min readApr 20, 2022

By Gordon Shotwell, Lead Data Scientist

In Defense of Prototyping

Feature engineering is beset by the so-called “two languages problem” in which the code to deploy new machine learning features needs to be written first in R or Python and then subsequently rewritten for production deployment. On the surface this approach seems incredibly wasteful — after all, if we could use a single language for developing features and deploying them engineers could save a whole lot of development time.

What this critique ignores, in my opinion, is that prototyping and deploying features serve very different purposes and it’s not really possible to accomplish both of those purposes with the same codebase. When you’re researching and developing features, the main goal is development speed. You want to be able to develop and test new features as fast as possible because you want to be able to discard bad ideas as fast as possible. In production the main goal is stability and computational speed. You want low-dependency, fast code which is written to be maintained for a long period of time.

Prototypes and Production Have Different Goals

What we realized at Socure is that it’s hard to solve for all of these goals with a single system. Data scientists are not good at writing performant, low dependency code, and application engineers are not able to develop features fast enough for us to meet our modeling targets. This speaks to the quality of the people we can hire to do these jobs more than it is a quality of tech stack. For example Julia is a language which is designed to be expressive enough for research and fast enough for production, but it’s still extremely difficult to find Julia engineers who are able to quickly write expressive code which is stable, well tested, and contains few dependencies.

Once we admitted that prototyping and deploying features were different jobs, we set out to build a prototyping system which allows us to move fast on the research side, and deploy accurately on the production side.

Features as Functions

Our feature prototyping system relies on function definitions. We have an internal R package which includes a function for each feature. Each function calculates exactly one feature, and references other features by calling other functions in the package.

age <- function(data, 
                date = date(data),
                dob = dob(data)) {
  return(date - dob)
}

date <- function(data) {
  return(data$date)
}

dob <- function(data) {
  return(data$dob)
}

To calculate the age feature using this approach you would call age(my_data) on a dataframe which included date and dob columns. This approach has a few main benefits.

Encapsulated Features

Writing one function for each feature means that you can easily add and modify features with confidence. For example if we wanted a new age_in_years feature we can add a new function to the package.

age_in_years <- function(data,
                         age = age(data)) {
  return(floor(age / 365))
}

Similarly, if you want to modify an upstream feature you can do so and have confidence that all of the downstream features will be calculated correctly.

Mining the Call Tree

Defining features in this way gives us a fantastically useful data structure. We can do some static code analysis to generate a dependency graph of how our features interact with one another. In particular by inspecting the formals of any given feature we can build up an edge list of all of our features, and end up with a graph of feature dependencies.

Using this data structure, we can comprehensively answer questions about our features. For example:

Our model isn’t performing well, which datasources might be causing the problem?
We want to deprecate a data source, which customers might be affected?
We have a bug in how our scores are calculated, where in the chain of features is it occurring?

By mining the call tree we can clearly see which features, models, and customers lie downstream of a particular dataset, and so this structure gives us the ability to answer these types of questions with more certainty.

Value Injection

As we add features to this model the whole system can become quite slow. The reason is that since all the features are calculated from scratch we may end up doing a lot of unnecessary computation. For example if we had a dataset which already had an age column, we can calculate age_in_years without recalculating age. The way we solve this problem is by creating a function which checks whether a feature is already present in a dataset, and exit the function early if it does.

checkForValue <- function(df) {
  # Deparse parent call
  call <- deparse(sys.calls()[[sys.nframe() - 1]]) 
  call <- paste(call, collapse = " ")
  feature_name <- stringr::str_extract(call, "^.*?\\(") %>%
    stringr::str_replace("\\($", "")
  
  if (feature_name %in% names(df)) {
    # Return from parent call
    rlang::return_from(parent.frame(), df[[fun_name]])
  } 
}

age <- function(data, 
                date = date(data),
                dob = dob(data)) {
  checkForValue(data)
  return(date - dob)
}

What checkForValue does is first inspect the parent environment to extract the name of the parent function. If there’s a variable in the dataset which matches that name it returns it, otherwise it continues on with the function calculation. This process lets us calculate features a bit more quickly, but also lets us step in to various parts of the chain to see what’s going wrong. For example we can investigate a bug with a feature by pulling a dataset with the features immediately upstream of that feature to see what’s going wrong.

Communicating with Production

This framework for feature engineering also gives us a great way to ensure that the features are deployed correctly. Because we know the dependencies of each feature, we can generate test files which include all possible values of that feature. For example let’s say that we had a feature for Floridians over the age of 65.

senior_floridian <- function(data,
                             state = state(data),
                             year_age = age_in_years(data)) {
  
  out <- ifelse(state == "Florida", year_age > 65, 1, 0)
  return(out)
}

When deploying this function we could generate a test file which included all of the possible values for state and age. When we deploy this function we provide both the code for the new function, and a test fixture which exhaustively defines the feature.

test_data <- expand.grid(list(
  state = c("Florida", "Montana", NA),
  age_in_years = c(0, 64, 65, 66, 100, NA)
))

test_data$senior_floridian <- senior_floridian(test_data)
write.csv(test_data, "senior-floridian-test-file.csv")

This process turns our prototyping exercise into a kind of test-driven development. When the production engineer begins work on implementing a feature they have a complete, exhaustive test file which fully defines the behavior of the feature. They can then use that test file to implement the feature, and as part of their automated test suites to ensure that the implementation doesn’t drift over time.