Featuretools on Spark

Apache Spark is a framework for distributed computing and big data processing.

Data and Problem

Featuretools is a Python library for automated feature engineering.

Distributed Approach

When one problem is too big, make lots of little problems.

Architecture Choices

Distributed architecture for running feature engineering on Spark.

Spark Cluster

Running instances in EC2 dashboard.
Dashboard showing spark cluster running.

PySpark Implementation

Pseudo code for calculating a feature matrix for one partition.
Code to parallelize feature matrix calculation.
Basic overview of Spark job at localhost:4040.
Information on Stages tab of job dashboard.
A subset of the 230 features for one partition of customers.

While this calculation would have been possible on a single machine , parallelizing feature engineering is an efficient method to scale to larger datasets. Furthermore, the partition and distribute framework is applicable in many different situations with significant efficiency gains.


If building meaningful, high-performance predictive models is something you care about, then get in touch with us at Feature Labs. While this project was completed with the open-source Featuretools, the commercial product offers additional tools and support for creating machine learning solutions.




Feature Labs Engineering Blog

Recommended from Medium

What’s the Big Deal About Big Data?

2- Econometric & Statistical Models in Time Series

What Is A Data Science Bootcamp? How Are They Different From Data Science Certificate Programs?

tsBNgen: A Python Library to Generate Time Series Data from an Arbitrary Dynamic Bayesian Network…

Week 2: 3,2,1 Reflection Video

COVID Tweets — Finding Similar Twitter Users in the First Days of the Pandemic


Data Science Project Life-cycle

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Will Koehrsen

Will Koehrsen

Data Scientist at Cortex Intel, Data Science Communicator

More from Medium

Churn Prediction with Pyspark

Data Transformation Using the Window Functions in PySpark

Determining Collinear Points using Apache Spark

Audio processing in Python with Feature Extraction for machine learning