Apache Flink Specifying Keys

M Haseeb Asif
Big Data Processing
3 min readMar 14, 2020

--

KeyBy is one of the mostly used transformation operator for data streams. It is used to partition the data stream based on certain properties or keys of incoming data objects in the stream. Once we apply the keyBy, all the data objects with same type of keys are grouped together. It is analogous to groupBy in the traditional SQL.

Internally, KeyBy is implemented with hash partitioning. Every KeyBy might cause a network shuffle that will partition the stream on different nodes which requires a lot of network communication hence it’s an expensive operation.

KeyBy is doing shuffle to group values with same keys

Flink data model is not based on key-value pairs. Therefore, you do not need to physically pack the data set types into keys and values. Keys are “virtual”: they are defined as functions over the actual data to guide the grouping operator.

There are various ways we can specify the key while applying the keyBy transformation. The easiest one is probably providing the index for tuples. You can provide multiple indexes as to group the item based on the composite key. For example, in the following code first we are using Integer or first tuple item as key and using integer and string as composite key for the second keyBy transformation operation.

DataStream<Tuple3<Integer,String,Long>> input = // [...]
KeyedStream<Tuple3<Integer,String,Long>,Tuple> keyed = input.keyBy(0)

--

--

M Haseeb Asif
Big Data Processing

Technical writer, teacher and passionate data engineer. Love to talk, write and code with Apache Spark, Flink or anything related to data