TransmogrifAI: Building ML Apps simplified with AutoML

Ajay Borra
Salesforce Engineering
5 min readNov 5, 2018

@himanshu and I work on the IDE team at Salesforce and have been learning recently about TransmogrifAI, our new AutoML library. We’ve put together this blog to help others get started and hope you enjoy it!

Image Credits: https://unsplash.com/photos/FlHdnPO6dlw

Can you guess how long it takes to build a machine learning application? days? weeks? months?

Generally, it takes months! We have asked ourselves this question and started focusing on improving this time to days, or even hours, to increase our productivity. TransmogrifAI (pronounced trans-mog-ri-phi) was born out of the necessity to reduce this time and assist with mundane tasks involved in machine learning. It is an end to end AutoML library for structured data which includes a rich type system to minimize runtime errors along with powerful features like automated feature engineering, feature selection, model selection, and hyper-parameter tuning. For more details on key ideas and high-level design, please refer to Open Sourcing TransmogrifAI.
In this blog, we cover the magic behind TransmogrifAI. We’ll demonstrate the level of effort needed in building a baseline machine learning application using SparkML vs. TransmogrifAI:

  • Building & Training a simple real estate app using California Housing Dataset with Spark ML.
  • Building the same app using TransmogrifAI and diving into the internals to reveal the magic.

The complete source code used in this blog is hosted on GitHub.

California Housing Dataset

The California Housing Dataset includes summary statistics of houses found in a given California district based on 1990 census data. These spatial data contain 20,640 observations on housing prices with 9 economic variables. The dependent variable is median house value.

  • Median house value
  • Longitude
  • Latitude
  • Housing median age
  • Total rooms
  • Total bedrooms
  • Population
  • Households
  • Median income

Real Estate House Price Prediction using Apache SparkML

Let’s dive into the code and steps for building the model with Spark ML.

  1. Set up Spark Context: This checks whether there is a valid thread-local or global default SparkSession and returns it if it’s available. If not, the method creates a new SparkSession and assigns the newly created SparkSession as the global default.

2) Read the DataSet from CSV as DataFrame: A DataFrame is a Dataset organized into named columns. It’s conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood.

DataFrame Output:

3) Cast to DoubleType: Since all the inputs features are Double, casting String to Double

Output:

4) Create Feature and Label Set: Divide the Dataframe into features and labels. We are using VectorAssembler to create a feature set. VectorAssembler is a transformer that combines a given list of columns into a single vector column. It is useful for combining raw features and features generated by different feature transformers into a single feature vector in order to train ML models like logistic regression and decision trees.

Output:

5) Scaling the features using Standard Scaler:

Output:

6) Divide the dataset into Test and Training set

7) Build a Linear Regression Model with the Scaled Features

Output:

The model can be used for predictions now.

It takes a fair amount of time and effort to create the simple app in the Apache Spark example above. In contrast, we can quickly build an app using the TransmogrifAI CLI module that comes out of the box with TransmogrifAI and explore the power of AutoML.

Real Estate House Price Prediction using TransmogrifAI

For the purpose of this blog, we are going to demonstrate how we can quickly generate a real estate housing price prediction application and train it using the California Housing dataset described above.

Code Generation

The fundamental idea behind code generation is type inference. Before going into further details on how type inferencing works, lets quickly go over the steps that are required to generate the app and train the model by diving into the internal details of the code generation.

  1. Clone TransmogrifAI & Build CLI module
    Clone the TransmogrifAI project from GitHub:

Check out the latest release branch (in this example, 0.4.0):

Build the TransmogrifAI CLI:

2) Fetch Dataset & Generate Real Estate App
Create the required directories and download the dataset:

Generate and build the real estate app using the California housing dataset:

Step 3: Training

Run training on the dataset with the generated app. This step takes about five to ten minutes:

After running the above steps, we get an output the looks similar to the one below:

This particular result means that we successfully trained our model and identified that the gradient boosted tree (gbtr_ee53956069aa) is a good fit for the dataset with the above-mentioned training and test scores.

Type Inference

So far we have seen that using the TransmogrifAI CLI module we can quickly get an application off the ground and train it without any code. Now let’s dive a bit deeper and explore type inferencing.

In the example above, we leveraged the automatic type inferencing capability of TransmogrifAI which assigns a type to each field in the CSV file and detects the schema of the dataset. In this example, type assignment happens in two phases when we invoke transmogrifai gen --input data/cadata.csv with the --auto flag. In the first phase, the input dataset is passed on to the Spark CSV module which scans the entire dataset and assigns a primitive spark type (string, long, double) to each field to construct the schema of the dataset. This schema is passed on to the second phase to map these primitive type to rich data types that are supported by TransmogrifAI described in detail here. For the curious mind, here's the link to the code that handles automatic type inference.

In general, if you want more control over which type gets assigned to a field during code generation for the datasets, provide the --schema flag to the transmogrifai gen command. This flag allows the users to pass in an Apache Avro schema file with the field to primitive type mappings. In this case, the Avro schema passed to the CLI module is used to infer the primitive types of the fields. These primitive types are then used to infer the rich types supported by TransmogrifAI type hierarchy.

In both the cases mentioned above, we covered the most common scenario where the schema of the datasets is flat without any nested objects. But there can be use cases where the data reading part of the pipeline can present challenges with reading data from complex, nested data structures with nested schemas. For this scenarios, we recommend using the Custom Reader capabilities of the framework by providing customized Apache Avro schema file. You can tailor the dataset parsing to your needs and map these types to the TransmogrifAI type hierarchy.

The resulting fields composed of TransmogrifAI types are used to generate the feature abstraction for each field in the dataset. This schema comprising of feature abstractions enables the AutoML pipeline.

The real estate house price prediction use case with a handful of numeric features is a simplistic illustration of the power of TransmogrifAI. TransmogrifAI shines more with diverse feature types needing sophisticated feature engineering, real-world data with hindsight bias or data leakage, etc. Stay tuned for future blog posts where we uncover different aspects of the automated data pipeline capabilities which include feature engineering, feature selection, model selection, and hyper-parameter tuning offered by TransmogrifAI.

--

--