Announcing Ballet: a new approach to collaborative feature engineering

micahjsmith
Data to AI Lab | MIT
2 min readMay 4, 2021
An abstract graphic connecting the logos for the Ballet project, MyBinder, and Project Jupyter, as well as icons connoting the crowd and big data.

Though long accepted in industry, there is increasing recognition within the academic community that feature engineering — a process in which data scientists write code to transform raw variables into useful features for learning algorithms — is a critical component in many successful data science pipelines.

To that extent, a flurry of tools have sprung up to support this important process. From the open-source and academic communities, we have been given feature stores, dataflow execution engines, and a variety of attempts to support the feature engineering process with automation, including the well-known featuretools library (that originated right here at the Data to AI Lab!). To be sure, many data-driven organizations have their own internal tooling to support feature engineering as well.

Nevertheless, one challenge remains: how do developers write, organize, and maintain feature engineering code in the first place? Existing approaches range from collections of ad hoc SQL queries, to end-to-end processing scripts, to everything in-between.

Today, we’re announcing Ballet, a research project from Data to AI Lab that addresses this missing piece in feature engineering. Ballet is a lightweight framework for collaborative, open-source, data science that supports developers in exploring possible features, writing new feature definitions, and contributing them back to a shared repository. Feature definitions are short, well-structured snippets of Python code using Ballet’s feature engineering API and a library of flexible and powerful feature engineering primitives. These feature definitions are composed into a feature engineering pipeline which can be applied to any data instances following the same schema.

Out of the box, Ballet comes with several tools to support developers in every stage of feature engineering.

  • Ballet’s interactive client can be used by developers in a notebook setting to explore the raw data, review or extend existing features written by teammates, or load the feature engineering pipeline.
  • A suite of feature API tests allow developers or test routines to ensure that their implementations work well to extract feature values from unseen data instances.
  • Integrated streaming feature definition selection algorithms allows project owners/maintainers to reject features that don’t contribute to downstream ML performance
  • A custom software development bot can be installed on project repositories to manage contributions by detecting features that fail validation and suggesting fixes.
  • A full-fledged development environment built on top of JupyterLab and Binder that allows developers to spin up a notebook in the cloud and easily contribute back a feature definition without any low-level git operations required.
  • And more!

Interested in learning more about the Ballet project or using the framework on your own data? Check our website and get in touch!

--

--

micahjsmith
Data to AI Lab | MIT

I am a PhD Student in Computer Science at MIT in the Data to AI Lab, and also an experienced software developer and ML engineer. Moderately caffeinated.