Apache Spark: Window function vs Struct function

Artsiom Yudovin
Analytics Vidhya
Published in
3 min readJun 22, 2020

The goal of this article is to compare the performance of two ways of processing data. The first way is based on the Window function. The second way is based on Struct. These two ways of processing data sometimes can help to achieve the same results but in different ways.

The window function performs a calculation over a group of records called windows that are in some relation to the current record.

The struct function is used to append a StructType column to a DataFrame.

The goal is to find the last parent for each child.

Let’s look on a data:

Figure 1. Input Data

The parent field is StringType. The time field is TimeStampType. The child field is StringType.

Each child should belong to several parents. In this case, we need to find the last parent for each child. The last parent is the parent that has the last timestamp. In results, we will have pairs as parent-child.

Figure 2. Output data

Let’s try to implement it on Apache Spark!

There are several ways to implement it on Apache Spark. From my point of view, two the most readable and the most elegant solutions can be used.

Window function

The solution looks like:

Firstly, the string with the child list has been converted to the array using the UDF function.

After that, the data should be filtered in case of an empty array is existing. The previous steps are necessary to clean data. Now data is ready and we can explode our array. Because of this, we get pair as parent-child but the goal is to find the last parent for our child. Window function helps us with this deal. max($”utc_timestamp”) over byChild returns the max utc_timestamp over a group of records that are related to the current child. As soon as max utc_timestamp has been found for each child, we just need to compare current utc_timestamp with max utc_timestamp. If they are the same, it means we found what we were looking for. It’s the last parent.

Struct function

The solution looks like:

Here, we made the same steps to clear data. The following steps are to prepare a pair of utc_timestamp with parent by using struct function. It’s similar to the tuple. After that, we can find a parent with maximum utc_timestanp by a child using a simple max function. In the case of using the max function with the struct column, this function compares values by the first field in the struct and if these values are the same, the function will compare by the second field in the struct. In the last step, the parent can be gotten from the struct with the following code: select($”p.parent”).

That’s it. We got the same result as in the previous option.

Summary

The story shows two options of transformation data to find the last parent for the child. Both options look good and you can choose any option you prefer more. From my point of view, the second option looks more readable but it requires knowledge about how the struct column works with aggregation functions.

--

--