EXPEDIA GROUP TECHNOLOGY — SOFTWARE

Deep Dive into Apache Spark Array Functions

A practical guide to using array functions

Neeraj Bhadani

Published in

Expedia Group Technology

13 min readJul 16, 2020

An array of solar panels in the field on bright sunny day. — Photo by Chelsea on Unsplash

In this post, we’ll learn about Apache Spark array functions using examples that show how each function works. You may also be interested in my earlier posts on Apache Spark.

Note: the examples that follow use the Scala API. For your easy reference, a Zeppelin notebook exported as a JSON file and also a Scala file are available on GitHub.

Setup

First we need to import the functions we will explore in this blog.

Let’s create some Spark DataFrames that we’ll use to learn about the various array functions.

Here we have created two DataFrames df and full_df which contain two columns and three columns respectively.

Check schema

Let’s check the schema of the above DataFrame full_df

The output shows that col1 is string type and array_col1 and array_col2 are array type:

root
|-- col1: string (nullable = true)
|-- array_col1: array (nullable = true)
|    |-- element: integer (containsNull = true)
|-- array_col2: array (nullable = true)
|    |-- element: integer (containsNull = true)

Similarly for DataFramedf, the schema looks like this:

root
|-- col1: string (nullable = true)
|-- array_col2: array (nullable = true)
|    |-- element: integer (containsNull = true)

In the examples that follow we will use df for functions that take a single array as input and df_full for functions that take two arrays as input.

Array functions

array_contains

If we need to find a particular element is present in array, we can use array_contains function. This function returns true if the value is present in array and false otherwise.

Output:

+----+------------------+------+
|col1|        array_col2|result|
+----+------------------+------+
|   x|   [1, 2, 3, 7, 7]|  true|
|   z|[3, 2, 8, 9, 4, 9]|  true|
|   a|      [4, 5, 2, 8]| false|
+----+------------------+------+

We are looking for value ‘3’ in array column array_col2 and getting true for the first two rows where ‘3’ is present and false for last column because ‘3’ is not present.

array_distinct

This function returns only distinct values from an array and removes duplicate values.

Output:

+----+------------------+---------------+
|col1|        array_col2|         result|
+----+------------------+---------------+
|   x|   [1, 2, 3, 7, 7]|   [1, 2, 3, 7]|
|   z|[3, 2, 8, 9, 4, 9]|[3, 2, 8, 9, 4]|
|   a|      [4, 5, 2, 8]|   [4, 5, 2, 8]|
+----+------------------+---------------+

Duplicate values got removed and only distinct values are present from array column result.

array_except

This function returns the elements from first array which are not present in second array. This is logically equivalent to set subtract operation.

Output:

+----+------------------+------------------+---------+
|col1|        array_col1|        array_col2|   result|
+----+------------------+------------------+---------+
|   x|   [4, 6, 7, 9, 2]|   [1, 2, 3, 7, 7]|[4, 6, 9]|
|   z|[7, 5, 1, 4, 7, 1]|[3, 2, 8, 9, 4, 9]|[7, 5, 1]|
|   a|      [3, 8, 5, 3]|      [4, 5, 2, 8]|      [3]|
+----+------------------+------------------+---------+

Column result contains elements that are only present in array_col1 but not in array_col2. For example, in the first row the result column contains [4, 6, 9] because these elements are present in array_col1 but not in array_col2.

array_intersect

This function returns common elements from both arrays. This is logically equivalent to set intersection operation.

Output:

+----+------------------+------------------+------+
|col1|        array_col1|        array_col2|result|
+----+------------------+------------------+------+
|   x|   [4, 6, 7, 9, 2]|   [1, 2, 3, 7, 7]|[7, 2]|
|   z|[7, 5, 1, 4, 7, 1]|[3, 2, 8, 9, 4, 9]|   [4]|
|   a|      [3, 8, 5, 3]|      [4, 5, 2, 8]|[8, 5]|
+----+------------------+------------------+------+

Column result contains the elements that are common in both the array columns (array_col1 and array_col2). For example, in the first row the result column contains [7, 2] because these elements are present in botharray_col1 and array_col2 .

array_join

This Function joins all the array elements based on delimiter defined as the second argument.

Output:

+----+------------------+-----------+
|col1|        array_col2|     result|
+----+------------------+-----------+
|   x|   [1, 2, 3, 7, 7]|  1,2,3,7,7|
|   z|[3, 2, 8, 9, 4, 9]|3,2,8,9,4,9|
|   a|      [4, 5, 2, 8]|    4,5,2,8|
+----+------------------+-----------+

Column result contains a string which is a concatenation of all the array elements using a specified delimiter (comma in this example).

Note: if there are any null values then we can replace with third argument (nullReplacement) with any string value.

array_max

This function returns the maximum value from an array.

Output:

+----+------------------+------+
|col1|        array_col2|result|
+----+------------------+------+
|   x|   [1, 2, 3, 7, 7]|     7|
|   z|[3, 2, 8, 9, 4, 9]|     9|
|   a|      [4, 5, 2, 8]|     8|
+----+------------------+------+

Column result contains the maximum value from each array in a row. For example, in the first row the result column contains ‘7’ because this is the maximum element in array [1, 2, 3, 7, 7].

array_min

This function returns the minimum value from an array.

Output:

+----+------------------+------+
|col1|        array_col2|result|
+----+------------------+------+
|   x|   [1, 2, 3, 7, 7]|     1|
|   z|[3, 2, 8, 9, 4, 9]|     2|
|   a|      [4, 5, 2, 8]|     2|
+----+------------------+------+

Column result contains the minimum value from each array in a row. For example, in the first row the result column contains ‘1’ because this is the minimum element in array [1, 2, 3, 7, 7].

array_position

This function returns the position of first occurrence of a specified element. If the element is not present it returns 0.

Let’s try to find the position of element say ‘7’ from column array_col2 .

Output:

+----+------------------+------+
|col1|        array_col2|result|
+----+------------------+------+
|   x|   [1, 2, 3, 7, 7]|     4|
|   z|[3, 2, 8, 9, 4, 9]|     0|
|   a|      [4, 5, 2, 8]|     0|
+----+------------------+------+

In the first row we get position ‘4’ because ‘7’ occurs the first time in position four. For the rest of the rows, we get ‘0’ because ‘7’ is not present.

array_remove

This function removes all the occurrences of an element from an array.

Let’s remove the element ‘7’ from column array_col2.

Output:

+----+------------------+------------------+
|col1|        array_col2|            result|
+----+------------------+------------------+
|   x|   [1, 2, 3, 7, 7]|         [1, 2, 3]|
|   z|[3, 2, 8, 9, 4, 9]|[3, 2, 8, 9, 4, 9]|
|   a|      [4, 5, 2, 8]|      [4, 5, 2, 8]|
+----+------------------+------------------+

All occurrences of element ‘7’ are removed from array.

array_repeat

This function creates an array that is repeated as specified by second argument.

Output:

+----+------------------+----------------------------------------+ |col1|array_col2        |result                                  | +----+------------------+----------------------------------------+ |x   |[1, 2, 3, 7, 7]   |[[1, 2, 3, 7, 7], [1, 2, 3, 7, 7]]      | |z   |[3, 2, 8, 9, 4, 9]|[[3, 2, 8, 9, 4, 9], [3, 2, 8, 9, 4, 9]]| |a   |[4, 5, 2, 8]      |[[4, 5, 2, 8], [4, 5, 2, 8]]            | +----+------------------+----------------------------------------+

Array from array_col2 got repeated 2 times in result column. For example, in the first row the result column contains the array [1, 2, 3, 7, 7] twice.

array_sort

This function sorts the elements of an array in ascending order.

Output:

+----+------------------+------------------+
|col1|        array_col2|            result|
+----+------------------+------------------+
|   x|   [1, 2, 3, 7, 7]|   [1, 2, 3, 7, 7]|
|   z|[3, 2, 8, 9, 4, 9]|[2, 3, 4, 8, 9, 9]|
|   a|      [4, 5, 2, 8]|      [2, 4, 5, 8]|
+----+------------------+------------------+

Array in column result got sorted in ascending order. For example, in the last row the column result contains [2, 4, 5, 8] which is sorted in ascending order.

array_union

This function returns the union of all elements from the input arrays.

Output:

+------------------+------------------+------------------------+ |array_col1        |array_col2        |result                  | 
+------------------+------------------+------------------------+ 
|[4, 6, 7, 9, 2]   |[1, 2, 3, 7, 7]   |[4, 6, 7, 9, 2, 1, 3]   | 
|[7, 5, 1, 4, 7, 1]|[3, 2, 8, 9, 4, 9]|[7, 5, 1, 4, 3, 2, 8, 9]| 
|[3, 8, 5, 3]      |[4, 5, 2, 8]      |[3, 8, 5, 4, 2]         | 
+------------------+------------------+------------------------+

Note: dropped the col1 to fit the result here in code block.

Column result contains the union of arrays from column array_col1 and array_col2 and contains distinct values only.

arrays_overlap

This function checks if at least one element is common/overlapping in arrays. It returns true if at least one element is common in both array and false otherwise. It returns null if at least one of the arrays is null.

Output:

+----+------------------+------------------+------+
|col1|array_col1        |array_col2        |result|
+----+------------------+------------------+------+
|x   |[4, 6, 7, 9, 2]   |[1, 2, 3, 7, 7]   |true  |
|z   |[7, 5, 1, 4, 7, 1]|[3, 2, 8, 9, 4, 9]|true  |
|a   |[3, 8, 5, 3]      |[4, 5, 2, 8]      |true  |
+----+------------------+------------------+------+

All the values in result column are true because we have at-least one element common in array_col1 and array_col2 for all rows. For example, in the first row the result column is true because the elements ‘2’ and ‘7’ are present in both columns array_col1 and array_col2.

arrays_zip

This function merges the i-th element of an array and returns array<struct>.

Since both the array columns have same numbers of values, let’s remove some values from one array column and see how it behaves with different values in the array with zip operation.

First, we will remove element ‘2’ from array column array_col2 and then try to zip column array_col1 with newly created column new_array_col

Output:

+------------------+---------------+-------------------------------+
|array_col1        |new_array_col  |result                         |
+------------------+---------------+-------------------------------+
|[4, 6, 7, 9, 2]   |[1, 3, 7, 7]   |[[4, 1], [6, 3] ..., [2,]]     |
|[7, 5, 1, 4, 7, 1]|[3, 8, 9, 4, 9]|[[7, 3], [5, 8] ..., [1,]]     |
|[3, 8, 5, 3]      |[4, 5, 8]      |[[3, 4], ... [3,]]             |
+------------------+---------------+-------------------------------+

In first row, first element of result column is [4, 1] which is a zip of first element from array array_col1 (4) and new_array_col (1). Also, last element of result column is [2,] (which is a zip of 5-th element) and second value is blank because there is no 5-th element in first row of column new_array_col.

Let’s check the type of the result columns as well.

Output:

root
|-- result: array (nullable = true)
|    |-- element: struct (containsNull = false)
|    |    |-- array_col1: integer (nullable = true)
|    |    |-- new_array_col: integer (nullable = true)

Type of the result column is array<struct>.

concat

This function concatenates all the elements of both arrays into a single one.

Output:

+------------------+------------------+----------------------------+
|array_col1        |array_col2        |result                      |
+------------------+------------------+----------------------------+
|[4, 6, 7, 9, 2]   |[1, 2, 3, 7, 7]   |[4, 6, 7, 9, ..., 3, 7, 7]  |
|[7, 5, 1, 4, 7, 1]|[3, 2, 8, 9, 4, 9]|[7, 5, 1, 4, ..., 9, 4, 9]  |
|[3, 8, 5, 3]      |[4, 5, 2, 8]      |[3, 8, 5, 3, 4, 5, 2, 8]    |
+------------------+------------------+----------------------------+

Column result contains the array which is a concatenation of arrays in columns array_col1 and array_col2.

Note: In order to fit the output we removed a few elements from result column from the display.

element_at

This function returns the element at a specified index.

Let’s try to get the first element from each array.

Output:

+----+------------------+------+
|col1|array_col2        |result|
+----+------------------+------+
|x   |[1, 2, 3, 7, 7]   |1     |
|z   |[3, 2, 8, 9, 4, 9]|3     |
|a   |[4, 5, 2, 8]      |4     |
+----+------------------+------+

Column result contains the first element from each array. For example, in the first row the result contains ‘1’ because this is first element in the array [1, 2, 3, 7, 7].

flatten

This function returns a single array from array of an arrays. If an array is more than 2 levels deep, it removes one level of nesting from an array.

Let’s first generate the nested array using the function array_repeat as discussed above and then flatten the nested array.

Output:

+-----------------------------------+------------------------------+ 
|repeat                             |result                        | 
+-----------------------------------+------------------------------+ 
|[[1, 2, 3, 7, 7], [1, 2, 3, 7, 7]] |[1, 2, 3, 7, 7, 1, 2, 3, 7, 7]| 
|[[3, 2, 8, 9, 4], [3, 2, 8, 9, 4]] |[3, 2, 8, 9, 4, 3, 2, 8, 9, 4]| 
|[[4, 5, 2, 8], [4, 5, 2, 8]]       |[4, 5, 2, 8, 4, 5, 2, 8]      | 
+----------------------------------------+-------------------------+

Column result contains all the values from an array of arrays from column repeat but in a single array.

map_from_arrays

This function creates a map column. Elements of the first column will be used for keys and second column will be used for values.

Output:

+------------------+------------------+--------------------+ 
|        array_col1|        array_col2|              result| 
+------------------+------------------+--------------------+ 
|   [4, 6, 7, 9, 2]|   [1, 2, 3, 7, 7]|[4 -> 1, 6 -> 2, ...| 
|[7, 5, 1, 4, 7, 1]|[3, 2, 8, 9, 4, 9]|[7 -> 3, 5 -> 2, ...| 
|      [3, 8, 5, 3]|      [4, 5, 2, 8]|[3 -> 4, 8 -> 5, ...| 
+------------------+------------------+--------------------+

Column result contains the map generated from both the input arrays. The first element in first row is 4 -> 1 where ‘4’ is a key which is the first element from first column array_col1 and ‘1’ is the key’s value which is the first element from second column array_col2.

reverse

This function reverses the order of elements in input array.

Output:

+----+------------------+------------------+ 
|col1|        array_col2|            result| 
+----+------------------+------------------+ 
|   x|   [1, 2, 3, 7, 7]|   [7, 7, 3, 2, 1]| 
|   z|[3, 2, 8, 9, 4, 9]|[9, 4, 9, 8, 2, 3]| 
|   a|      [4, 5, 2, 8]|      [8, 2, 5, 4]| 
+----+------------------+------------------+

Column result contains the reverse of array present in column array_col2. For e.g in first row, result contains [7, 7, 3, 2, 1] which is reverse of array [1, 2, 3, 7, 7] from column aray_col2 .

shuffle

This function shuffles the elements of array randomly.

Output:

+----+------------------+------------------+ 
|col1|        array_col2|            result| 
+----+------------------+------------------+ 
|   x|   [1, 2, 3, 7, 7]|   [2, 7, 1, 7, 3]| 
|   z|[3, 2, 8, 9, 4, 9]|[3, 8, 9, 4, 9, 2]| 
|   a|      [4, 5, 2, 8]|      [8, 4, 2, 5]| 
+----+------------------+------------------+

Column result contains shuffled elements from column array_col2. In other words, order of elements in result column is random. For example, in the first row the result column contains [2, 7, 1, 7, 3] which is the shuffled output of array [1, 2, 3, 7, 7] from column array_col2.

Note: Try executing the shuffle function multiple times. The order of values in the result column will be different for each execution.

size

This function returns a number of elements in an array or map.

Output:

+----+------------------+------+ 
|col1|        array_col2|result| 
+----+------------------+------+ 
|   x|   [1, 2, 3, 7, 7]|     5| 
|   z|[3, 2, 8, 9, 4, 9]|     6| 
|   a|      [4, 5, 2, 8]|     4| 
+----+------------------+------+

Column result contains the size(number of elements) of an array in column array_col2. For example, in the first row the result column contains ‘5’ because number of elements in [1, 2, 3, 7, 7] is 5.

slice

This function slices the array into a sub-array. We can specify the start of the index as second argument and number of elements as third argument.

Note: Arrays in spark start with index 1. It also supports negative indexing to access the elements from last.

Let’s try to create a sub-array of 3 elements starting from index 2.

Output:

+----+------------------+---------+ 
|col1|        array_col2|   result| 
+----+------------------+---------+ 
|   x|   [1, 2, 3, 7, 7]|[2, 3, 7]| 
|   z|[3, 2, 8, 9, 4, 9]|[2, 8, 9]| 
|   a|      [4, 5, 2, 8]|[5, 2, 8]| 
+----+------------------+---------+

In first row, result contains sub-array [2, 3, 7] which is created with 3 elements from index 2 in [1, 2, 3, 7, 7].

sort_array

This function sorts the array in ascending order by default. However, we can sort in descending order with second arg as asc=false.

Output:

+----+------------------+------------------+ 
|col1|        array_col2|            result| 
+----+------------------+------------------+ 
|   x|   [1, 2, 3, 7, 7]|   [7, 7, 3, 2, 1]| 
|   z|[3, 2, 8, 9, 4, 9]|[9, 9, 8, 4, 3, 2]| 
|   a|      [4, 5, 2, 8]|      [8, 5, 4, 2]| 
+----+------------------+------------------+

Column result is sorted in descending order. For example, in the first row the result column contains [7, 7, 3, 2, 1] which is the descending sorted result of array[1, 2, 3, 7, 7] from column array_col2 .

explode

This function creates a new row for each element of an array or map.

Let’s first create new column with fewer values to explode.

Output:

+----+---------+ 
|col1|slice_col| 
+----+---------+ 
|   x|   [1, 2]| 
|   z|   [3, 2]| 
|   a|   [4, 5]| 
+----+---------+

slice_col contains 2 elements in an array. So upon explode, this generates 2 rows for each array.

Now let's try to explode the column slice_col.

Output:

+----+---------+------+ 
|col1|slice_col|result| 
+----+---------+------+ 
|   x|   [1, 2]|     1| 
|   x|   [1, 2]|     2| 
|   z|   [3, 2]|     3| 
|   z|   [3, 2]|     2| 
|   a|   [4, 5]|     4| 
|   a|   [4, 5]|     5| 
+----+---------+------+

Upon explode, 2 rows are generated for each element of an array in column slice_col.

posexplode

This function creates a new row for each element with position of an array or map.

Let’s first create new column with fewer values to explode.

Now let’s try to explode the slice_col with a position as well.

Output:

+----+---------+---+---+ 
|col1|slice_col|pos|col| 
+----+---------+---+---+ 
|x   |[1, 2]   |0  |1  | 
|x   |[1, 2]   |1  |2  | 
|z   |[3, 2]   |0  |3  | 
|z   |[3, 2]   |1  |2  | 
|a   |[4, 5]   |0  |4  | 
|a   |[4, 5]   |1  |5  | 
+----+---------+---+---+

Here 2 more columns got generated as pos (represent position) and col (represent element value).

I hope you have enjoyed learning about working with various Array functions in Apache Spark. For easy reference, a notebook containing the examples above is available on GitHub.

Reference

https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions

EXPEDIA GROUP TECHNOLOGY — SOFTWARE

Deep Dive into Apache Spark Array Functions

A practical guide to using array functions

Setup

Check schema

Array functions

array_contains

array_distinct

array_except

array_intersect

array_join

array_max

array_min

array_position

array_remove

array_repeat

array_sort

array_union

arrays_overlap

arrays_zip

concat

element_at

flatten

map_from_arrays

reverse

shuffle

size

slice

sort_array

explode

posexplode

Reference

Written by Neeraj Bhadani