A Look into PayPal’s Contributions to Apache DataFu

Eyal Allweil
Jan 15 · 6 min read
Photo by Louis Reed on Unsplash

Raw customers’ data, with more than one row per customer
REGISTER datafu-pig-1.5.0.jar;IMPORT ‘datafu/dedup.pig’;data = LOAD ‘customers.csv’ AS (id: int, name: chararray, purchases: int, date_updated: chararray);dedup_data = dedup(data, ‘id’, ‘date_updated’);STORE dedup_data INTO ‘dedup_out’;
“Deduplicated” data, with only the most recent record for each customer
dedup_data = dedup(data, ‘id’, ‘date_updated’);
dedup_data = dedup(data, ‘(id, name)’, ‘date_updated’);

REGISTER datafu-pig-1.5.0.jar;IMPORT ‘datafu/sample_by_keys.pig’;data = LOAD ‘customers.csv’ USING PigStorage(‘,’) AS (id: int, name: chararray, purchases: int, updated: chararray);customers = LOAD ‘sample.csv’ AS (cust_id: int);sampled = sample_by_keys(data, customers, id, cust_id);STORE sampled INTO ‘sample_out’;
Only customers 2, 4, and 6 appear in our new sample

  1. We will change date_updated for record 2, julia
  2. We will change purchases and date_updated for record 4, alice
  3. We will add a new row, record 8, amanda
REGISTER datafu-pig-1.5.0.jar;IMPORT ‘datafu/diff_macros.pig’;data = LOAD ‘dedup_out.csv’ USING PigStorage(‘,’) AS (id: int, name: chararray, purchases: int, date_updated: chararray);changed = LOAD ‘dedup_out_changed.csv’ USING PigStorage(‘,’) AS (id: int, name: chararray, purchases: int, date_updated: chararray);diffs = diff_macro(data,changed,id,’’);DUMP diffs;
diffs = diff_macro(data,changed,id,’’);
diffs = diff_macro(data,changed,id,’date_updated’);

data = LOAD ‘transactions.csv’ USING PigStorage(‘,’) AS (name: chararray, transaction_id:int);grouped = GROUP data BY name;counts = FOREACH grouped {
distincts = DISTINCT data.transaction_id;
GENERATE group, COUNT(distincts) AS distinct_count;
};
DUMP counts;
REGISTER datafu-pig-1.5.0.jar;DEFINE CountDistinctUpTo3 datafu.pig.bags.CountDistinctUpTo(‘3’);
DEFINE CountDistinctUpTo5 datafu.pig.bags.CountDistinctUpTo(‘5’);
data = LOAD ‘transactions.csv’ USING PigStorage(‘,’) AS (name: chararray, transaction_id:int);grouped = GROUP data BY name;counts = FOREACH grouped GENERATE group,CountDistinctUpTo3($1) as cnt3, CountDistinctUpTo5($1) AS cnt5;DUMP counts;


PayPal Engineering

The PayPal Engineering Blog

Eyal Allweil

Written by

I asked the painter why the roads are colored black / He said, “Steve, it’s because people leave / And no highway will bring them back.”

PayPal Engineering

The PayPal Engineering Blog