Apache Flink Tutorials — Transformations on Single DataSet

Transformation operators play important roles when processing data in Apache Flink and Apache Sprak as well. For most data sources, they are necessary for us to derive desired target datasets. Flink supports many built-in transfromation operators, such as map, flatMap, and filter, etc for users to do so. This post aims to showcase the use of built-in transformation operators in Apache Flink. A brief introduction to those operators could be found in DataSet API programming guide. There also has the corresponding documents in Chinese which are contributed by Apache Flink Taiwan User Group. And here gives a deep-dive into the available transformations on DataSets.

MovieLens Data Sets

Throughout this post, data sets from MovieLens will be used for demonstrating transformation examples that are available here. The data set includes the following four CSV files:

  • movies.csv: three-field movie information, including movieId, title, and genres.
  • ratings.csv: users’ 5-star rating to available movies.
  • tags.csv: free-text tagging activities.
  • links.csv: Identifiers used to link to other sources of movie data, such as IMDB and TMDB.

The above dataset files are all encoded as UTF-8.

Example: Retrieve The Set of Genres

Herein, movies.csv is utilized as the data source to be input into the example program to derive the set of distinct genres.

First of all, we have to obtain an ExecutionEnvironment and then load/create the initial dataset by means of the ExecutionEnvironment.

// obtain an ExecutionEnvironment
val
env = ExecutionEnvironment.getExecutionEnvironment
// load the initial dataset
val movieDataset = env.readTextFile("/path/to/movies.csv")

In the above snippet of codes, we load the initial dataset via readTextFile rather than readCsvFile function though the dataset is of CSV format. It is because that the dataset file is written as comma-separated values file with a single header row. And columns that contain commas (`,`) are escaped using double-quotes (`”`). An unexpected DataSet may probably be derived from applying readCsvFile function. Namely, the title field in the movies.csv file is very likely to be parsed as multiple fields when it contains commas.

To get genres with distinct elements, we need to apply a series of transformations as follows.

movieDataset.map(line => {
val fieldsArray = line.split(',')
(fieldsArray.head, fieldsArray.last) }}
.filter { tup => {
var canBeLong = true
try {
tup._1.toLong
}
catch {
case _: Throwable => canBeLong = false
}
canBeLong }}}
.flatMap(tup => tup._2.split('|'))
.distinct()
.print()

The map operator gets movieId and genres as type String from each line read from movies.csv. The filter operator is then applied to rule out the head line according to whether the movieId is able to be casted into type Long. The flatMap operator tries to extract genres to which each movie belongs. We can then utilize distinct operator for deriving the final results which are listed as follows.

Animation
Film-Noir
Thriller
War
(no genres listed)
Action
Adventure
Children
Comedy
Crime
Documentary
Drama
Fantasy
Horror
IMAX
Musical
Mystery
Romance
Sci-Fi
Western

Rich Functions

Instead of writing Lambda transformation functions, one may want to put the complicated business logic into a separated function and pass it to Flink for enhancing readability of a program. The rich functions could be of great help in achieving the desire.

All transformations that take as argument a lambda function can instead take as argument a rich function. Take the map transformation designed in the aforementioned example, we may write

class GenreMapFunction extends RichMapFunction[String, (String, String)] {
override def map(in: String): (String, String) = {
val fieldArray = in.split(',')
(fieldArray.head, fieldArray.last)
}
}

and pass the function to a map function, such as

movieDataset.map(new GenreMapFunction())

In Apache Spark, mapPartition has been proposed to have better performance than map when transforming RDDs. One may would like to use mapPartition rather than map transformation operator as follows.

class MovieMapPartFunction extends RichMapPartitionFunction[String, (String, String)] {
override def mapPartition(
values: Iterable[String],
out: Collector[(String, String)]): Unit = {
val scalaValues = values.asScala
scalaValues.foreach {
value: String =>
val fields = value.split(',')
out.collect((fields.head, fields.last))
}
}
}

Then we can pass the function to a mapPartition function, such as

movieDataset.mapPartition(new MovieMapPartFunction())

Similarly, we can write a rich filter function rather than a Lambda function as follows.

class GenreFilterFunction extends RichFilterFunction[(String, String)] {
override def filter(in: (String, String)): Boolean = {
var canBeLong = true
try {
in._1.toLong
}
catch {
case _: Throwable => canBeLong = false
}
canBeLong
}}}
}

And pass the function to a filter function, such as

movieDataset.map(new GenreMapFunction())
.filter(new GenreFilterFunction())