Filtering Vs Enriching Data in Apache Spark

Published in

Capital One Tech

5 min readFeb 5, 2019

I know it’s a vague heading… In the broader scheme of big data processing, there are various options/choices available, with various tradeoffs, on how data can be handled. The intent of this blog is to address that vagueness by comparing two options my team tried out, and why we came to prefer one over the other.

Data lies in Capital One’s DNA. All our decisions are data-driven, and being a financial company with a lot of regulatory requirements we are obligated to provide more granular processing data rather than high level details.

But before we jump into more details around the topic, here is the quick gist of Apache Spark for beginners (Skip to the next section if you are already familiar with Apache Spark.)

Quick Intro to Apache Spark

Apache Spark is an open source big data framework for analysing massive data at scale using computing clusters. The main advantage of Spark is its in-memory computation, which increases the speed of application processing. Also, Spark supports a variety of workloads like batch, machine learning, streaming using same infrastructure established once. Spark was started in UC Berkeley as an academic project and later contributed to the Apache Foundation in 20 Jun 2013.As of writing this blog, the latest version of Spark is 2.4

Our Application Use Case #1: Filter

Capital One is a heavy user of Apache Spark for its batch and streaming workloads. The application I work on is one of the core credit card transaction processing engines for computing rewards for our card customers. Our batch pipeline has a number of Spark jobs, and the focus for this blog is on the first version of a job named Filter.

Functionality of Filter

Filter takes credit card transactions as input, applies a bunch of business logic, and filters out non-relevant transactions for earning rewards.

Issues With Filter

We had this version in production for a few months and realized an issue in debugging a data problem. Each stage mentioned in the above pipeline is a spark inner join transformation between two datasets. The final output persisted qualified transactions with rewards calculated. However, we weren’t able to trace to the transaction level why/which particular business logic made it unqualified for earning rewards. This was because all intermediate join result sets are computed in-memory and passed to the next stage when spark action is performed.

Immediately, natural questions arose. Could we do spark action counts at the end of each stage and cache the dataset? In fact, we were already doing this. But the issue was, it could only give us how many transactions were filtered as part of the stage’s business logic, not the detail of each transaction.

Sample Data Representation of the Workflow (with Counts)

**Spark Data Representation of Filter Job**

Assume ten transaction are fed as inputs. After applying the business logic (Account Eligibility), only five transactions move to next stage, In the next stage, the same thing happens and it filters out three more transactions, finally making two qualified transactions with calculated rewards.

How Did We Overcome This Problem?

As previously stated, the core issue is with business logic data filtration using spark in-memory inner-join. This makes it hard to get into the granular details of each transaction, determining why/how business rules are applied for debugging. After much discussion, the team came up with a different design pattern of enriching the data rather than filtering it in each step. Which brings us to…

Our Application Use Case #2: Enrichment

Our new job is Enrichment. For this job, instead of applying spark inner-join and business logic filtration in one step, we do it in two steps.

First Step — Data enrichment where each stage enriches the required data from each business logic dataframe using spark left outer join.
Second Step — business logic filtration using the data points gathered in the previous step.

Sample Data Representation of Workflow (With Counts) Using Enrichment

**Spark Data Representation of Enrichment Job**

Same example as before. Ten transactions are fed as input. After applying left outer join with the business logic (Account Eligibility) it gathers the required columns for filtering in later stages. So, transactions are not filtered, rather we are enriching the original input dataset with required data columns from each stage. After applying business logic filtration in the last stage, we still see only two qualifying transactions. However, the difference is more enriched data gained along the way.

Advantages of Enrichment over Filter

Now, at each stage required data is captured using left outer join and enriched into the original dataset. It captures state information for more detailed analysis/debugging later. The same data columns/flags are used to apply the business logic, and in the later stage yields the same result but with more granularity. With this approach we were able to find the state of each business logic data column/flag for each transaction.

Making the Switch

When we deployed Enrichment to Prod, we wanted to make sure that the output was the same. To verify, we ran both jobs against the same input dataset and deployed comparison jobs to compare the results of Filter and Enrichment. Over that period our comparison jobs consistently gave results on both filter and enrichment jobs.This gave us confidence to move forward with our new Enrichment pattern. After successful verification of Enrichment job’s performance in production, we replaced Filter and have been successfully running Enrichment in production for a year now. Filter has been depreciated and sunset from our overall processing workflow.

Enrich vs Filter?

Our choice of Apache Spark for our platform modernization efforts definitely yielded good results in terms of performance and accuracy of overall operation. Our platform processes millions of transactions a day, awarding millions of miles, cash, points to customers every day. Given the high numbers — in terms of processing and accuracy we expected out of our platform — it’s not really a surprise we leaned towards enriching over filtering our data.

Merits of Enrichment Data Pattern

Persists same set of data and grows the dataset in columns.This granular data tracking helps for better debugging and accuracy in identifying and reporting

Merits of Filtering Data Pattern

Faster computing due to in-memory processing and provides high level metrics

Hope this comparison helps in your use case decision tradeoffs.

DISCLOSURE STATEMENT: These opinions are those of the author. Unless noted otherwise in this post, Capital One is not affiliated with, nor is it endorsed by, any of the companies mentioned. All trademarks and other intellectual property used or displayed are the ownership of their respective owners. This article is © 2019 Capital One.