From Big Data to Fast Data in Four Weeks or How Reactive Programming is Changing the World — Part 1

Published in

The PayPal Technology Blog

9 min readNov 8, 2016

Part 1: Reactive Manifesto’s Invisible Hand

Let me first setup the context for my story. I’ve been with PayPal for 5-years. I’m an architect. I’m part of the team responsible for PayPal Tracking domain. Tracking is commonly and historically understood as the measurement of customer visits to web pages. With the customer’s permission our platform collects all kinds of signals from PayPal web pages, mobile apps and services, for variety of reasons. Most prominent among them are measuring new product adoptions, A/B testing, and fraud analysis.

We collect several terabytes of data on our Hadoop systems every day. This is not an extremely huge amount, but it is big enough to be one of the primary motivations for adopting Hadoop at PayPal four years ago.

Story

This particular story starts in the middle of December 2015, a week or so before holidays. I was pulled into a meeting to discuss a couple of new projects.

At the time, PayPal was feverishly preparing to release a brand spanking new mobile wallet app. It had been in development for a long time and promised a complete change in PayPal experience on smartphones. Lots of attention was being paid to this roll out from all levels of leadership — basically, a Big Deal!

Measuring new product adoption happens to be one of the main “raison d’etre” of the tracking platform, so following roll outs is nothing new for us. This time, however, we were ready to provide adoption metrics in near real time. By the launch date in the end of February, Product managers were able to report the new app adoption progress “as of 30 min ago” instead of the customary “as of yesterday”.

This was our first full production Fast Data application. You might be wondering why the title says four weeks, though from mid-December to the end of February it is ten weeks.

It did in fact take a small team — Sudhir R., Loga V., Shunmugavel C., Krishnaprasad K. — only four weeks to make it happen.

The first full volume live test happened during the Super Bowl, when PayPal aired its ad during the game. And small volume tests happened even earlier in January when the ad was first showcased in some news channels.

Just one year back there would have been no chance we could deliver on something like that in four weeks’ time, but in December 2015 we were on a roll. We had a production quality prototype running within two weeks and a full volume live test happened two weeks later. In another six weeks, PayPal’s largest release of the last several years was measured by this analytical workflow.

So what changed during a single year?
Allow me to get really nerdy for a second. On a fundamental level, what happened is that we embraced (at first without even realizing it) the Reactive Manifesto in two key components of our platform.

What has changed between December 2014 and December 2015

The original Tracking platform design followed common industry standard from 4–5 years ago. We collected data in real-time, buffered it in memory and small files, and finally uploaded consolidated files onto Hadoop cluster. At this point a variety of MapReduce jobs would crunch the data and publish the “analyst friendly” version to Teradata repository, from which our end users could query the data using SQL.

Sounds straightforward? It is anything but. There are several pitfalls and bottlenecks to consider.
First of all is data movement. We are a large company, with lots of services and lots of different types of processing distributed across multiple datacenters. On top of all of this we are also paranoid about security of anything and everything.

We ended up keeping a dedicated Hadoop cluster in a live environment for the sole purpose of collecting data. Then we would copy files to our main analytical Hadoop cluster in a different data center.

It worked, but … . All this waiting for big enough files and scheduling of file transfers added up to hours. We have tinkered with different tools, open source and commercial, looking for the goldilocks solution to address all the constraints and challenges. At the end of the day most of those tools failed to account for the reality on the ground.

Until, that is, we discovered Kafka. That was the first reactive domino to fall.

Kafka and moving data across data centers

There are three brilliant design decisions which made Kafka the indispensable tool it is today:

multi-datacenter topology out of the box;
short-circuiting internal data movement from network socket to OS file system cache and back to network socket;
sticking with strictly sequential I/O reads and writes

It fits our needs almost perfectly:

Throughput? Moving data between datacenters? Check.
Can configure topology to satisfy our complicated environment? Flexible persisted buffer between real-time stream and Hadoop? Check.
Price-performance? High availability? Check.

Challenges? Few.
The only real trouble we’ve had with Kafka itself is the level of maturity of its producer API and MirrorMaker. There are lots of knobs and dials to figure out. It changes from release to release. Some producer API behaviors are finicky. We had to invest some time into making sure we are not losing data between our client and Kafka broker when the broker goes down or connection is lost.

MirrorMaker requires some careful configuration to make sure we are not double compressing events. It took us some experimentation in a live environment to get to a proper capacity for the MirrorMaker instances. Even with all the hiccups, considering we were using pre-GA release open-source tool, it was a pretty rewarding and successful endeavor.

Most of our troubles came not from Kafka. Our network infrastructure and especially firewall capacities were not ready for the awesome throughput Kafka has unleashed. We were able to move very high volume of data at sub second latency between geographically removed data centers, but every so many hours the whole setup would slow down to a crawl.

It took us some time to understand that it was not a Kafka or MirrorMaker problem (as we naturally suspected at first). As it happened we were the first team to attempt that kind of volume of streaming traffic between our data centers. Prevailing traffic patterns at the time were all bulk oriented: file transfer, DB import/export etc.. Sustained real-time traffic has immediately uncovered bottlenecks and temporary saturations, which were hidden before. Good learning experience for us and our Network engineers.

Personally I was very impressed with Kafka capabilities and naturally wanted to know how it was written. That led me to Scala. Long story short, I had to rediscover functional programming and with it a totally different way of thinking about concurrency. That led me to Akka and Actors model — Hello Reactive Programming!

From MapReduce to Spark

Another major challenge that plagued us at the time was MapReduce inefficiency, both in physical sense and in a developer’s productivity sense. There are, literally, books written on MapReduce optimization. We know why it is inefficient and how to deal with it — in theory.

In practice however, developers tend to use MapReduce in the simplest way possible, splitting workflows into multiple digestible steps, saving intermediate results to HDFS, and incurring horrifying amount of I/O in the process.

MapReduce is just too low level and too dumb. Mixing complex business logic with MapReduce low level optimization techniques is asking too much. In architectural lingo: the buildability of such system is very problematic. Or in English: it takes way too long to code and test. Predictable results are tons of long running, brittle, hard to manage, and inefficient workflows.

Plus it is Java! Who in their right mind would use Java for analytics? Analytics is a synonym for SQL, right? We ended up writing convoluted pipes in MapReduce to cleanup and reorganize data and then dumped it to Teradata so somebody else can do agile analytical work using SQL, while we were super busy figuring out MapReduce. Pig and Hive helped to a degree, but only just.

We have tried to explore alternatives like “SQL on Hadoop” etc. Lots of cool tools, lots of vendors. The Hadoop ecosystem is rich, but hard to navigate. None has addressed our needs comprehensively. Goldilocks solution for BigData was out of reach it seemed.

Until, that is, we stumbled into Spark. And that was the second reactive domino to fall.

By now, it has become clear to us that using MapReduce is a dead-end. But Spark? Switching to Scala? What about the existing, hard earned skill set? People just started to get comfortable with MapReduce and Pig and Hive, it is very risky to switch. All valid arguments. Nevertheless, we jumped.

What makes Spark such a compelling system for the big data?
RDD (Resilient Distributed Dataset) concept has addressed most of the MapReduce inefficiency issues, both physical and productivity.

Resilient — resiliency is a key. The main reason behind MapReduce excessive I/O disorder is the need to preserve intermediate data to protect against node crashes. Spark solved the same with minimum I/O by keeping a history of transformations (functions) instead of actual transformed data chunks. Intermediate data can still be check pointed to HDFS if the process is very long and the developer can decide if a checkpoint is needed case by case. Once a dataset is read into cache, it can stay there as long as needed or as long as the cache size allows it. And again, the developer can control the behavior case by case.

Distributed Dataset- One thing that always bugged me in MapReduce is its inability to reason about my data as a dataset. Instead you are forced to think in single key-value pair, small chunk, block, split, or file. Coming from SQL, it felt like going backwards 20 years. Spark has solved this perfectly. A developer can now reason about his data as a Scala immutable collection (RDD API) or data frame (DataFrame API) or both interchangeably (DataSet API). This flexibility and abstraction from the distributed nature of the data leads to a huge productivity gains for developers.

And finally, no more Map + Reduce shackles. You can chain whatever functions you need, Spark will optimize your logic into DAG (dynamic acyclic graph — similar to how SQL is interpreted) and run it in one swoop keeping data in memory as much as physically possible.

“Minor” detail — nobody knew Scala here. So to get all the efficiency and productivity benefits we had to train people in Scala and do it quickly.

It is a bit more than just a language though. Scala, at the first glance, has enough of a similarity to Java that any Java coder can fake it. (Plus, however cumbersome, Java can be used as Spark language and so can Python).

The real problem was to change the mindset of the Java MapReduce developer to a functional way of thinking. We needed a clean break from old Java/Python MR habits. What habits? Most of all mutability: Java beans, mutable collections and variables, loops etc. We were going to switch to Scala and do it right.

It has started slow. We have designed a curriculum for our Java MapReduce developers. We started with couple of two-hour sessions, one for Functional programming/Scala basics and one for Spark basics. After some trial and error the FP/Scala part was extended to a four hour learning by example session, while Spark barely needed two hours if at all. As it happens, once you understand the functional side of Scala, especially immutable collections, collection functions, and lambda expressions, Spark comes naturally.

It snowballed very quickly, within a couple of months we had the majority of our MapReduce engineers re-trained and sworn off MapReduce forever. Spark was in.

Spark, naturally, is written in Scala and Akka — 2:0 for team Reactive.

Stay tuned for Part 2: “Lambda Architecture meets reality” coming soon

Part 2 Preview

Fast Data stack
Micro and Macro batches
Invisible hand keeps on waving