A comparison between RDD, DataFrame and Dataset in Spark from a developer’s point of view

Gaurav Gangwar
6 min readNov 1, 2018

--

APIs in Spark are great and contribute to the awesomeness of Spark. This so helpful framework is used to process big data.

However, it can be difficult to master all the APIs and the differences among them.

In this article, I would like to offer a comparison in Scala between RDD (Resilient Distributed Dataset), DataFrame and Dataset which are three ways to make immutable collections.

For those who do not know Spark, it could be a way to enter this world.

Comparison with some code

I would like to start by presenting some code that is really close of what I am working on every day.

Let’s suppose we deal with an AB Test on a video website. The variant is the background color:

  • 50% of the population has a blue background
  • 50% of the population has a green background

We would like to know what is the preferred background color to watch videos.

Every time a user watches a video, we save the information.

Our data look like this:

From now, we want to know the winner between the blue background and the green one.

We can make a comparison by doing this with RDD, DataFrame and Dataset using Spark 2.2 in Scala.

RDD

At the first line, we create an RDD from the file path:

If you are not used to developing with Scala and its type inference system, you may have not noticed that our RDD is typed. This is a RDD of Strings. Type inference in Scala is a useful functionality that acts as a facilitator but this is not an obligation. We are not obliged to use it.

We could have written this instead:

At the third line, we separate the items from our string:

Finally, at the seventh line, we can reduce by key (the background color):

And we can collect the information at the ninth line:

With this data set, “blue” is the winner.

Another way to write this code (to be closer to real life) would be:

DataFrame

At the first line, we do some imports:

At the third line, we construct a schema to read our file. A schema corresponds to a StructType with some StructFields. The StructFields are typed and can be nullable. We define a “userId” as a nullable string type, a “color” as a nullable string type and a “count” as a nullable integer type.

At the fifth line, we use this schema and the possibility to read a CSV with Spark to deal with our file. Actually, we can notice that finally our file can be managed just as a CSV file.

The schema can also be inferred by using the “inferSchema” option. We can read a CSV file this way:

At the ninth line, we count the number of seen videos for each color (group and sum) and we can get the output.

We can notice that we have a schema that gives us a better view of what we are doing. It is more expressive. The problem is that our DataFrame is not typed.

Dataset

We make some imports and create the custom schema.

At the fifth line, we create an “AbTest” class to link our data to a class:

At the seventh line, we link the file result to our class. This way, we transform our DataFrame to a Dataset. The main difference in this case is that our Dataset is typed.

It means that if we try to filter or select a nonexistent attribute of our AbTest class, we will have an analysis error saying that “value x is not a member of AbTest”.

Finally, we can group, sum and get the output:

Comparison with a table

So what are the differences between these 3 APIs?

History

Let’s begin by a bit of history.

RDD was the first API.

Then to make things easier, DataFrame was added. It is built upon RDD. It was inspired from SQL. You can deal with data by selecting columns, grouping them, etc.

Dataset is an improvement of DataFrame with type-safety. It is an extension of the DataFrame API. It was added in Spark 1.6 as an experimental API.

With Spark 2.0, Dataset and DataFrame are unified. “DataFrame” is an alias for “Dataset[Row]”.

In untyped languages such as Python, DataFrame still exists. In Java, DataFrame does not exist anymore in Spark 2.

Immutability

Immutability is at the heart of Spark. All the APIs deal with immutable objects.

We have two types of operations with our collections: transformations and actions.

An action gives the final output. It can print the information, save the results or transform the results in a usual object of the language you are using to deal with Spark (a “List” in Java for instance).

A transformation converts a collection to another. If a transformation fails, it is easier to restart this transformation only. Immutability is better for fault tolerance. It is also better for distribution.

I will not give further details, but yes immutability is very important in the Spark world.

Schema

This is the great difference between RDD and DataFrame/Dataset.

RDD has no schema. It fits well with unstructured data.

DataFrame/Dataset are more for structured data. The schema gives an expressive way to navigate inside the data.

Level

RDD is a low level API whereas DataFrame/Dataset are high level APIs.

With RDD, you have more control on what you do.

Performance

A DataFrame/Dataset tends to be more efficient than an RDD.

What happens inside Spark core is that a DataFrame/Dataset is converted into an optimized RDD. Spark analyses the code and chooses the best way to execute it.

For instance, if you want to group data before filtering it, it is not efficient. You group a bigger bunch of data than you need because you filter them after. If you do this with an RDD, Spark will execute this way.

But, if you do the same with a DataFrame/Dataset, Spark will optimize the execution and will filter your data before grouping them. This is what we call the “Catalyst optimization”.

Type-safe and errors

RDD and Datasets are typed safe (for typed languages).

From the previous examples, if you are not used to developing with Scala, it could appear a bit less obvious because of Scala type inference.

Once again, type inference in Scala is a useful functionality that you are not obliged to use. In Java, it does not exist. We have no choice and have to write something like that:

In the other hand, a DataFrame is not typed. This is a great plus for Dataset which is typed. As we have already seen it, with a DataFrame, you can select a nonexistent column (df.select(“nonExistentColumn”)) and notice your mistake only when you run your code. With a Dataset, you have a compile time error.

You can notice that “DataFrame” is an alias for “Dataset[Row]”. It could be useful to do a first migration from 1.6 to 2 for instance in Java which does not have DataFrame anymore in the latest versions. But then, it is a better option to type Datasets in another type than “Row”.

This way, we can get real analysis errors at compilation time. Once again, this is not true for languages that do not compile.

Conclusion

RDD are typed and offer a way to get analysis errors at compile time. It is a low level API with no performance optimization. It is less expressive too and is more useful for unstructured data.

DataFrame is more expressive and more efficient (Catalyst Optimizer). However, it is untyped and can lead to runtime errors.

Dataset looks like DataFrame but it is typed. With them, you have compile time errors. And that is the point.

**CC Credit Zenika**

--

--