Featuretools on Spark

Apache Spark is a framework for distributed computing and big data processing.

Data and Problem

Featuretools is a Python library for automated feature engineering.

Distributed Approach

When one problem is too big, make lots of little problems.

Architecture Choices

Distributed architecture for running feature engineering on Spark.

Spark Cluster

Running instances in EC2 dashboard.
Dashboard showing spark cluster running.

PySpark Implementation

Pseudo code for calculating a feature matrix for one partition.
Code to parallelize feature matrix calculation.
Basic overview of Spark job at localhost:4040.
Information on Stages tab of job dashboard.
A subset of the 230 features for one partition of customers.

While this calculation would have been possible on a single machine , parallelizing feature engineering is an efficient method to scale to larger datasets. Furthermore, the partition and distribute framework is applicable in many different situations with significant efficiency gains.


If building meaningful, high-performance predictive models is something you care about, then get in touch with us at Feature Labs. While this project was completed with the open-source Featuretools, the commercial product offers additional tools and support for creating machine learning solutions.




