Is Apache Flink Europe’s Wild Card into the Big Data Race?
How an ultra-fast data engine for Hadoop could secure Europe’s place in the future of open-source
There are inherent parallels between the evolution of data analytics and professional racing that sometimes make it unfairly straightforward to come up with a headline. When a vendor releases an important feature, one might reasonably observe that it’s lapping ahead of the competition. And the market consolidation occurring today can be illustrated, with similar ease, as the wear of the chase separating the laggards from the leaders.
But for all the times that it helped cut a corner in yours truly’s own race against the editorial schedule, never did the analogy fit better than now. The initial burst of acceleration that sent the analytics bandwagon roaring along its velocious course is rapidly heading to an effective plateau where processing times will consistently remain below the threshold which users consider a delay for any given task.
In the not-so-distant future of perfectly streamlined information access, a salesperson could expect to have a report about complex customer behavior patterns generated by the time they finish typing up a response to their boss’s request. And a trading program might be able to react to a stock change that once took several moments to internalize in under a second. What these opportunities often lack in individual significance they more than make up in sheer number, opening the door for an appreciable improvement in the flow of everyday life.
As is the case with so many of the other major technology trends that emerged in recent years, the wind is blowing from the open-source community, where two initiatives are separately trying to resolve the practical problems in meeting the shifting definition of real-time. With their dozens of corporate backers and hundreds of contributors, progress is a matter of when rather than if. But with the answers already in sight, a ragtag team of new-generation European academics proposes an alternate route.
Their wild card lies buried deep in Apache Software Foundation’s unassuming web page for incubating projects, the official cradle of the open-source community. Wedged between two tools for handling some of the more mundane tasks involved in processing copious amounts of unstructured information, Flink makes little effort to distinguish itself. The technology carries an almost painstakingly generic description explaining that it’s a “distributed, general-purpose data processing framework” just like the dozens of others out there that supports “iterations, incremental iterations and programs consisting of large DAGs of operation.” One would easily be excused for not giving it more than a passing glance.
But closer inspection reveals that there’s much more to Flink than what initially meets the eye. Taking its name from the German word for “speedy,” the project exploits the power of a fresh perspective — and a few old tricks — to smooth out some of the rougher edges in forcing the concept of real-time onto mountains of unwieldy data.
Before diving into the details, a little context is needed. Since information has to be kept somewhere to be accessed, storage is an unavoidable consideration in data processing. And when the results have to meet any reasonable definition of real-time, the speed at which the data can be shuffled to and from the storage medium becomes an overriding priority.
Because of that, Flink and its better-known alternatives in the open-source ecosystem stop half way on the route data ordinarily takes in its journey through the bowels of a distributed computer cluster, only moving information from where it lands in memory to disk when absolutely necessary. Hard drives are so much slower than the other components involved in the analytic heavy-lifting that cutting that last stage off the data travel path saves enough overhead to propel real-time processing into the realm of the possible with a single stroke.
That’s where the road starts to diverge for Flink and the rest of the distributed in-memory analytics pack. Like the competition, the project is following its own unique trajectory, one that traces back two years to the Technical University of Berlin. That’s when a group of database and distributed systems researchers first identified the need to import one of the most important lessons from the old world of information management into the new world of massively parallel data crunching.
And so Stratosphere, the precursor to Flink, was born. What the fledgling project brought to the table was not particularly creative but entirely groundbreaking: a cost-based optimizer of the kind found in relational platforms storing regular structured data adapted for use with unstructured workloads. The technology tailors the processing route for a particular dataset based on its specific properties to streamline what is both literally and effectively the last-mile to real-time analytics. Still caught up in the early buzz that latches itself onto every emerging trend, the broader ecosystem — most of it across the pond — was light-years away from reaching that point.
But the open-source community has a reputation for leapfrogging expectations when it puts its collective mind to it, so the Stratosphere team made sure to move fast. A roster of big-name sponsors including IBM, HP and T-Mobile were lined up behind the project at quick succession, and in April this year, it changed its name and entered incubation with the Apache Software Foundation.
Today, Flink is not the only option for cost-based optimization against unstructured data, but then the project has also evolved beyond the one-trick pony stage. The focus on efficiency at the cornerstone of the initiative is emerging as a base for addressing a much broader and more relevant challenge than putting the last polish on the analytics pipeline: unifying the processing of real-time and historical data.
That’s one of the bigger issues with Apache Spark, the Goliath to Flink’s David. Also born out of the academia, the project provides an engine for analyzing existing information at real-time speeds and comes with a component that extends that functionality to data that is itself real-time, such as tweets. The problem is that Spark Streaming falls squarely into the category of what is known as a bolt-on.
The technology runs on top of the core project and shares the same interface, but it also inherits the underlying architecture, which was built to handle batches of information as opposed to a sequence of data points flowing in one at a time. Spark Streaming provides a workaround, accumulating data for a shortened time window before pushing it down for processing, but that still leaves a delay of several seconds until ingestion — a far cry from real-time.
While still a long way from reaching its goal of unifying data analytics, Flink suffers from no such limitation. It’s also well-equipped for what Spark does best, featuring iteration operators that make it relatively straightforward to loop an algorithm over the same batch of information in search of useful patterns. Taken together with the built-in optimizer, which doubles as an abstraction layer insulating workloads from changes to the underlying infrastructure, that adds up to a tightly-integrated environment for ingesting every type of unstructured data imaginable in a highly efficient manner.
And that is potentially Flink’s greatest strength. For comparison, achieving similar functionality currently requires Spark users to set up a separate installation of Apache Storm, a real-time event processor that was itself in incubation until graduating some two weeks ago. That means buying extra hardware, configuring that hardware, configuring the software running on top and learning the nuances of a new tool, a difficult task made even more challenging by the fact that Spark only partially supports YARN.
The technology is the nerve center of the platform that almost single-handily turned data into a buzzword and serves as the foundation for Flink, Spark and Storm along with a dozen other open-source projects. Apache Hadoop owes its prominence to a unique capacity for cost-effectively handling vast pools of uninstructed information, a proposition that Spark makes it exceptionally complicated to realize today.
Without the efficient multitenancy afforded by YARN, a Hadoop cluster can only feasibly support a single processing paradigm. So if a user wants to deploy two different models for handling their data — say, Spark and Storm — they have to deploy each on its own dedicated Hadoop installation, which significantly increases the amount of time and resources that have to go into the project.
The issue is only temporary, however, with Hadoop distributor Hortonworks planning to have Spark fully integrated with YARN by early 2015. But that narrow window gives the growing group of contributors behind Flink, which already supports the technology, an invaluable opportunity to catch up on weak areas such as fault tolerance and solidify their vision for unified analytics.
Much more hangs at balance than just the success of Flink as a project. If the team behind the platform executes its roadmap correctly — and just as importantly, fast enough — their efforts could help shift Europe’s largely passive role in the open-source community’s quest to deliver on the “data-driven” hype into a much more active one. That Hortonworks co-founder Alan Gates is listed as a sponsor of Flink on the Apache Incubator index indicates that the industry is already starting to recognize the tremendous opportunity at hand.
Banner via crazyoctopus
Apache Flink logo via @ApacheFlink
Spark Streaming slide via P. Taylor Goetz