Open Sourcing TransmogrifAI
Automated Machine Learning for Structured Data
Despite huge progress in machine learning over the past decade, building production-ready machine learning systems is still hard. Three years ago when we set out to build machine learning capabilities into the Salesforce platform, we learned that building enterprise-scale machine learning systems is even harder. To solve the problems we encountered, we built TransmogrifAI (pronounced trans-mog-ri-phi) — an end-to-end automated machine learning library for structured data, that is used in production today to help power our Einstein AI platform. Today we are excited to share this project with the open source community and empower other developers and data scientists to build machine learning solutions at scale, fast.
When building machine learning capabilities for consumer products, data scientists usually focus on a handful of well understood use cases and datasets. In contrast, the diversity of the data and use cases at enterprise companies makes machine learning for enterprise products a whole other challenge. At Salesforce, our customers are looking to predict a host of outcomes — from customer churn, sales forecasts and lead conversions to email marketing click throughs, website purchases, offer acceptances, equipment failures, late payments, and much more. It is critical for enterprise customers that their data is protected and not shared with other organizations or competitors. This means that we have to build customer-specific machine learning models for any given use case. Even if we could build global models, it makes absolutely no sense to do so because every customer’s data is unique, with different schemas, different shapes, and different biases introduced by different business processes. In order to make machine learning truly work for our customers, we have to build and deploy thousands of personalized machine learning models trained on each individual customer’s data for every single use case!
The only way to achieve this without hiring an army of data scientists is through automation. Most auto-ML solutions today are either focused very narrowly on a small piece of the entire machine learning workflow, or are built for unstructured, homogenous data for images, voice and language. But we needed a solution that could rapidly produce data-efficient models for heterogeneous structured data at massive scale. The dictionary defines Transmogrification as the process of transforming, often in a surprising or magical manner, which is what TransmogrifAI does for Salesforce — enabling data science teams to transform customer data into meaningful, actionable predictions. Today, thousands of customer-specific machine learning models have been deployed across the platform, powering more than 3 billion predictions every day.
In the rest of this post, we will describe the TransmogrifAI workflow, discuss the design choices under the hood, and point to further links to help get started using and contributing to this library.
The TransmogrifAI Workflow
The research and development that typically goes into building good machine learning models is considerable. The rigmarole of data preparation, feature engineering and model training is an iterative process that takes weeks or even months of a data scientist’s time, making it ripe for automation. TransmogrifAI is a library built on Scala and SparkML that does precisely this. With just a few lines of code, a data scientist can automate data cleansing, feature engineering, and model selection to arrive at a performant model from which she can explore and iterate further.
TransmogrifAI encapsulates five main components of the machine learning process:
Feature Inference: The first step in any machine learning pipeline is data preparation. The data scientist gathers all relevant data and flattens, joins and aggregates the different data sources to extract raw signals that might have predictive power. The extracted signals are then populated in a flexible data structure, commonly known as a DataFrame, from where they can be further manipulated downstream. While these data structures are simple and easy to manipulate, they don’t provide the data scientist with protection from downstream errors such as incorrect assumptions about types or nulls in the data. As a result, a data scientist runs the risk of running a pipeline overnight only to come in the next morning to find that it failed because she tried to multiply two strings.
We solve this in TransmogrifAI by allowing users to specify a schema for their data, and automatically extracting the raw predictor and response signals as “Features”. Features are strongly typed, and TransmogrifAI supports a rich and extensible feature type hierarchy. This hierarchy goes beyond the primitive types to support more nuanced types such as geo-location, phone numbers, zipcodes and more — differentiating between types that data scientists would want to treat differently. In addition to allowing for user-specified types, TransmogrifAI also does inference of its own. For instance, if it detects that a text feature with low cardinality is in fact a categorical feature in disguise, it catalogs this and treats it appropriately. The strongly-typed features allow developers to catch a majority of errors at compile-time rather than run-time. They are also key in the automation of type-specific downstream processing that is common to machine learning pipelines.
Transmogrification (a.k.a automated feature engineering): While having strongly typed features helps a great deal to reason about your data and minimize downstream errors, eventually, all features need to be transformed into a numeric representation that exposes the regularities of the data in a way that can be easily exploited by machine learning algorithms. This process is known as feature engineering. There are infinite ways to go about transforming the types of features seen in the image above and doing it the right way is the art of data science.
As an example, let’s ask ourselves how we would go about transforming a US state (e.g. CA, NY, TX, etc.) into a number. One way could be to map each state to a number between 1 and 50. The problem with this encoding is that it preserves no information about the geographical proximity of states. But proximity may well be an important feature when trying to model shopping behaviors. Another encoding we could try is to use the distance between the center of that state and the center of the US. This would solve the first problem, but still wouldn’t encode information about whether the states are in the north, south, west, or eastern part of the country. This was a simple illustration for one feature — imagine doing this across hundreds or thousands! What makes this process especially challenging is that there is no one correct way, and successful approaches depend a great deal on the problem that we are trying to optimize.
Automatic engineering of dozens of different feature types into numeric vectors is what gives TransmogrifAI its name. TransmogrifAI comes with a myriad of techniques for all the supported feature types ranging from phone numbers, email addresses, geo-locations to text data even. These transformations are not just about getting the data into a format which algorithms can use, TransmogrifAI also optimizes the transformations to make it easier for machine learning algorithms to learn from the data. For example, it might transform a numeric feature like age into the most appropriate age buckets for a particular problem — age buckets for the fashion industry might differ from wealth management age buckets.
But even with all of the above, feature engineering is an endless game. So in addition to providing default techniques, we put significant effort into making it easy to rapidly contribute and share feature engineering techniques so that developers can customize and extend the defaults in a reusable way.
Automated Feature Validation: Feature engineering can result in an explosion in the dimensionality of the data. And high dimensional data is often riddled with problems! For instance, the usage of particular fields in the data may have drifted over time, and models trained on these fields may perform poorly on fresh data. Another huge (and often overlooked) problem is that of hindsight bias or data leakage. This occurs when information that will not actually be present at prediction time leaks into the training examples. The result is models that look amazing on paper but that are entirely useless in practice. Consider a dataset containing information about deals, where the task is to predict the deals that are likely to closely. Imagine a field in this dataset called the “Closed Deal Amount”, that only gets populated after a deal closes. A blindly applied machine learning algorithm would consider this field to be highly predictive since all closed deals would have a non-zero “Closed Deal Amount.” In reality though, this field will never be filled out for a deal that is still in the works, and the machine learning model will perform poorly for these deals where the predictions actually matter! Such hindsight bias is especially problematic at Salesforce where unknown and automated business processes often populate much of the customers’ data, making it very easy for a data scientist to confuse cause and effect.
TransgmogrifAI has algorithms that perform automatic feature validation to remove features with little to no predictive power — features whose usage has drifted over time, features that exhibit zero variance, or features whose distribution in the training examples varies significantly from their distribution at prediction time. These algorithms are especially useful for preserving one’s sanity when working with high dimensional and unknown data that may be riddled with hindsight bias. They apply a slew of statistical tests based on feature types, and additionally makes use of feature lineage to detect and discard such bias.
Automated Model Selection: The final stage of a data scientist’s process involves applying machine learning algorithms to the prepared data to build a predictive model. There are many different algorithms that one could try, each with a variety of knobs that can be tuned to varying degrees. Finding the right algorithm and parameter settings can make all the difference between a performant model and one that is no better than a coin toss.
The TransmogrifAI Model Selector runs a tournament of several different machine learning algorithms on the data and uses the average validation error to automatically choose the best one. It also automatically deals with the problem of imbalanced data by appropriately sampling the data and recalibrating predictions to match true priors. There is often a significant gap in the performance of the best and worst models a data scientist trains on her data, and exploring the space of possible models is critical to avoid leaving too much on the table.
Hyperparameter Optimization: Underlying all of the stages above is a hyperparameter optimization layer. In the machine learning community today, hyperparameters refer specifically to just the tunable knobs on the machine learning algorithms. However the reality is that all of the stages above come with a variety of knobs that matter. For example, during feature engineering, one might tune the number of binary variables that are pivoted out from a categorical predictor. The sampling rate for dealing with imbalanced data is yet another knob that can be adjusted. Tuning all of these parameters can be overwhelming to a data scientist, but can really make the difference between a great model and one that is essentially a random number generator. This is why TransmogrifAI comes with some techniques for automatically tuning these hyperparameters and a framework to extend to more advanced tuning techniques.
At Salesforce, such automation has brought down the total time taken to train models from weeks and months to a few hours. And the code that encapsulates all this complexity is quite simple. It takes just a few lines of code to specify the automated feature engineering, feature validation, and model selection above:
Design Choices
TransmogrifAI was built with the goal of improving machine learning developer productivity — not only through machine learning automation, but also through an API that enforces compile time type-safety, modularity, and reuse. Here are some of the notable design choices we made.
Apache Spark: We chose to build TransmogrifAI on top of Apache Spark for a number of reasons. First, we need to be able to handle large variations in the size of the data. While some of our customers and use cases require training models on tens of millions of records that need to be aggregated or joined, others depend on a few thousands of records. Spark has primitives for dealing with distributed joins and aggregates on big data which were important for us. Second, we needed to be able to serve our machine learning models in both a batch and streaming setting. With Spark Streaming, it was easy to extend TransmogrifAI to work in both modes. Finally, by building on top of an active open source library, we could leverage the continuous improvements that are constantly being made to that library without having to reinvent the wheel for everything.
An Abstraction for Features: SparkML Pipelines introduced the abstractions of Transformers and Estimators for transforming DataFrames. TransmogrifAI builds on top of these abstractions (Transmogrification, Feature Validation, and Model Selection above, are all powered by Estimators), and in addition, introduces the abstraction of Features. A Feature is essentially a type-safe pointer to a column in a DataFrame and contains all the information about that column — its name, the type of data it contains, as well as lineage information about how it was derived.
Features then become the main primitive that developers interact with, and defining and manipulating features becomes more like working with variables in a programming language than manipulating columns in a DataFrame. Features are also shareable, allowing for collaboration and reuse amongst developers. In addition, TransmogrifAI provides the ability to easily define features that are the result of complex time-series aggregates and joins, but this could be the topic of another blogpost all together.
Type Safety: Features are strongly typed. This allows TransmogrifAI to do type checks on the entire machine learning workflow, and ensure that errors are caught as early on as possible instead of hours into a running pipeline. Type-safety also comes with other niceties for developer productivity including the ability to allow for intelligent IDEs to suggest code completion. Below you can see all the possible transformations you could perform on a numeric feature and select the one you would like to apply.
Type safety also increases the transparency around the expected inputs and outputs at every stage of a machine learning workflow. This in turn greatly reduces the amount of tribal knowledge that inevitably tends to accumulate around any sufficiently complex machine learning workflow.
Finally, the feature types are key for type-specific downstream processing, particularly for automated feature engineering and feature validation.
Customizability & Extensibility: While developers can make use of the automated estimators to quickly spin up performant models, for users who desire more control, every single out-of-the-box estimator is parameterized and the parameters can be set and tuned directly by the data scientist herself. Additionally, one can easily specify custom transformers and estimators to be used in the pipeline. Specifying a custom transformer can be as easy as defining a lambda expression, and TransmogrifAI takes care of all the boiler plate for serializing and de-serializing transformers for you.
Scale & Performance: With automated feature engineering, data scientists can easily blow up the feature space, and end up with wide DataFrames that are hard for Spark to deal with. TransmogrifAI workflows address this by inferring the entire DAG of transformations that are needed to materialize features, and optimize the execution of this DAG by collapsing all transformations that occur at the same level of the DAG into a single operation. At the same time, since it is built on top of Spark, TransmogrifAI automatically benefits from ongoing improvements in the underlying Spark DataFrame optimization.
The result is that we can apply automated machine learning techniques on data with millions of rows and hundreds of columns, exploding the feature space to tens of thousands of columns in the process.
Empowering Everyone to TransmogrifAI
TransmogrifAI has been transformational for us, enabling our data scientists to deploy thousands of models in production with minimal hand tuning and reducing the average turn-around time for training a performant model from weeks to just a couple of hours. While this level of automation has been essential for us to scale for enterprise purposes, we believe that every business today has more machine learning use cases than it has data scientists, and automation is key to bringing the power of machine learning within reach.
Salesforce has been a long-time user and contributor to Apache Spark, and we are excited to continue to build TransmogrifAI alongside the community. Machine learning has the potential to transform how businesses operate, and we believe that barriers to adoption can only be lowered through an open exchange of ideas and code. By working in the open we can bring together diverse perspectives to continue to push the technology forward and make it accessible to everyone.
For more information on how to get started with TransmogrifAI, please check out the project.
Acknowledgements: This post would be incomplete without an acknowledgement of the leadership and contributions of Leah McGuire, Matthew Tovbin, Kevin Moore, Michael Weil, Michael Loh, Mayukh Bhaowal, Vitaly Gordon, and the entire Einstein data science team.