Split-Apply-Combine Strategy for Data Mining

Anurag Pandey
Analytics Vidhya
Published in
9 min readOct 26, 2018

--

In a typical exploratory data analysis, we approach the problem by dividing the data set at some granular level and then aggregating the data at that granularity in order to understand the central tendency. Similarly, a famous (must read) paper by, Hadley Wickham, outlines split-apply-combine strategy as one of the most common strategies in data analysis. Be it Marketing Segmentation, or any Behavioral Research, we use this technique at some point during our analysis.

Introduction

This article attempts to illustrate split-apply-combine strategy in which we break up a big problem into small manageable pieces (Split), operate on each piece independently (Apply) and then put all the pieces back together (Combine). Split-Apply-Combine can be used by many existing tools by using GroupBy function in SQL and Python, LOD in Tableau, and by using plyr functions in R to name a few. In this article, we will not be discussing only the implementation of this strategy, but also we will see some relevant application of this strategy in Feature Engineering.

In Python we do this by using GroupBy and it involves one or more of the three steps of the Split-Apply-Combine strategy. Let us start by defining each of the three steps:

--

--