Apache Flink
What is Apache Flink?
Apache Flink is yet another new generation general big data processing engine that targets to unify different data loads. Does it sound like Apache Spark? Exactly. Flink is trying to address the same issue that Spark is trying to solve. Both systems are targeted towards building a single platform where you can run the batch, streaming, interactive, graph processing, ML, etc. So Flink does not differ much from Spark in terms of ideology. But they do differ a lot in the implementation details.
Why Apache Flink?
Apache Flink is mainly based on the streaming model, Apache Flink iterates data by using streaming architecture. Now, the concept of an iterative algorithm is bound into a Flink query optimizer. So, Apache Flink’s pipelined architecture allows processing the streaming data faster with lower latency than micro-batch architectures (Spark). Suggested to use Flink where you want to use real-time processing.
The following are the features of Apache Flink.
=> It gives High throughput hence performance is high
=> Low latency
=> Fault Tolerence
=> One Runtime for both Streaming and Batch processing
=> Before execution of Program, it is optimized
Architecture, API, and Language support.
Apache Flink has Master-Slave architecture in which it can have multiple masters and several slaves as per project requirements. It also supports auto-scaling as and when required.
Apache Flink has API’s like Dataset for batch processing, DataStream for real-time processing, Table for SQL processing, Gelly for graph processing, and FlinkML for Machine Learning Algorithm.
It has language support of java, scala, and python. As Flink is written in java and scala it is better to use java or scala as a language for writing the scripts or big data logic because it has more support of inbuilt functions in java and scala.
Architecture
Comparison between Spark and Flink.
Apache Flink and Spark are major technologies in the Big Data landscape. There is some overlap (and confusion) about what each does and do differently. This post will compare Spark and Flink to look at what they do, how they are different, what people use them for, and what streaming is.
How is Apache Flink different from Apache Spark, and, in particular, Apache Spark Streaming?
At first glance, Flink and Spark would appear to be the same. The main difference is Flink was built from the ground up as a streaming product. Spark added Streaming onto their product later.
Let’s take a look at the technical details of both.
The Flink Streaming Model
To say that a program processes streaming data means it opens a file and never closes it. That is like keeping a TCP socket open, which is how Syslog works. Batch programs, on the other hand, open a file, process it, then close it.
Flink says it has developed a way to checkpoint streams without having to open and close them. To checkpoint means to notate where one has left off and then resume from that point. Then they run a type of iteration that lets their machine language algorithms run faster than Spark. That is not insignificant as ML routines can take many hours to run.
Flink versus Spark in Memory Management
Flink has a different approach to memory management. Flink pages out to disk when memory is full, which is what happens with Windows and Linux too. Spark crashes that node when it runs out of memory. But it does not lose data, since it is fault-tolerant. To fix that issue, Spark has a new project called Tungsten.
Flink and Spark Machine Learning
What Spark and Flink do well is take machine learning algorithms and make them run over a distributed architecture. That, for example, is something that the Python scikit-learn framework and CRAN (Comprehensive R Archive Network) cannot do. Those are designed to work on one file, on one system. Spark and Flink can scale and process enormous training and testing data sets over a distributed system. It is worth noting, however, that the Flink graph processing and ML libraries are still in beta test, as of January 2017.
Command Line Interface (CLI)
Spark has CLIs in Scala, Python, and R. Flink does not really have a CLI, but the distinction is subtle.
To have a Spark CLI means a user can start up Spark, obtain a SparkContext, and write programs one line at a time. That makes walking through data and debugging easier. Walking through data and running map and reduce processes, and doing that in stages, is how data scientists work.
Flink has a Scala CLI too, but it is not the same. With Flink, you write code and then run print() to submit it in batch mode and wait for the output.
Again this might be a matter of semantics. You could argue that spark-shell (Scala), pySpark (Python), are sparkR (R) are batch too. Spark is said to be “lazy.” That means when you create an object it only creates a pointer to it. It does not run any operation until you ask it to do something like count(), which would require creating the object to measure it. So it would submit that to its batch engine then.
Both languages, of course, support batch jobs, which is how most people would run their code once they have written and debugged it. With Flink, you can run Java and Scala code in batch. With Spark, you can run Java, Scala, and Python code in batch.
Support for Other Streaming Products
Both Flink and Spark work with Kafka, the streaming product written by LinkedIn. Flink also works with Storm.
Cluster Operations
Spark or Flink can run locally or on a cluster. To run Hadoop or Flink on a cluster one usually uses Yarn. Spark is usually run with Mesos. If you want to use Yarn with Spark then you have to download a version of Spark that has been compiled with Yarn support.