The Impact of Spark Filter on Filtered Views

Published in

Globant

5 min readJul 6, 2023

Apache Spark can read data from a view that already has a filter applied during its creation. This raises questions about how Spark handles such scenarios. There are several possibilities, such as filtering on both the partition column and the view query or filtering only on the view query without considering the partition column. Let’s explore different scenarios and understand the implications of these executions.

Setting up

Let’s consider an example using a table called events that stores information about events. The events table has a column called event_date which represents the date of the event. Assume the data is partitioned based on the event_date column.

CREATE TABLE events (
  event_id INT,
  event_date DATE,
  event_type VARCHAR(255),
  PRIMARY KEY (event_id)
)
PARTITION BY event_date;

We create a view called current_month_events that contains data only for the current month using the following command:

CREATE VIEW current_month_events AS
SELECT *
FROM events
WHERE event_date >= CURRENT_DATE
AND event_date < CURRENT_DATE + INTERVAL 1 MONTH;

Use Case 1: Spark filter applied over the view partition column

If you read data from current_month_events and apply a filter on the event_date column, Spark will only read the data for the current month. This improves performance as Spark doesn’t need to read all the data from the events table.

For example, the following command reads data from current_month_events and filters it to return events that occurred on the 1st of the month:

SELECT *
FROM current_month_events
WHERE event_date = CURRENT_DATE;

This command only reads the data for the current month from the events table, leading to improved performance.

In conclusion, when you create a view of a table with a partitioned event_date column, you essentially create a new table with the same schema as the original table but containing only the data for the specified event_date partition. If you read data from the view using Spark and apply a filter on the event_date partition, Spark will only read the data from the specified partition, improving performance.

Use Case 2: View filter on the partition column, but Spark read filter on a different column

If the view filter is applied on the partition column, but the Spark read filter is not applied on the partition column, Spark will read the complete month of data and then apply the filter.

For instance, the view called current_month_events that only contains data for the current month. This view query filter is applied on the event_date partition column, ensuring that only data for the current month are available to Spark.

SELECT *
FROM current_month_events
WHERE event_type = 'corporate';

If you try to apply a filter to the view based on a different column than the partition column, Spark will read the complete month of data and apply the filter on top of the loaded partitions. However, this may not return any data if the filter doesn’t match.

Use Case 3: Filter on a day outside the range of the view query filter

If you apply a filter on a day that falls outside the range of the view query filter, Spark will not return any data. The view query filter defines the data available to Spark. If you try to filter the data to include days unavailable to Spark, no data will be returned.

For instance, the view called current_month_events that only contains data for the current month. This view query filter is applied to retrieve events that occurred two months ago from the current date:

SELECT *
FROM current_month_events
WHERE event_date = CURRENT_DATE - INTERVAL 2 MONTH;

This command will try to filter the data so that it includes events that happened two months ago. However, the current_month_events view only contains data for the current month. Therefore, Spark will not be able to return any data.

Use Case 4: Spark adds filter when applying a filter on a view that already has a filter

When Spark adds an additional filter to a view that already has a filter, the DAG of the query will have two stages. The first stage will read the data from the view, and the second stage will apply the additional filter.

The first stage of the DAG will be a simple scan of the view. This is because the view filter has already been applied to the data, so Spark does not need additional filters at this stage.

The second stage of the DAG will be a filter operation. This is because the Spark filter is not already satisfied by the view filter. The Spark filter will be applied to the data read from the view in the first stage.

The following is an example of the DAG of a query when Spark adds an additional filter when applying a filter on a view that already has a filter:

Stage 1: Read Data From View
    - Read data from view `current_month_events`
    - Filter data by `event_date >= CURRENT_DATE` and `event_date < CURRENT_DATE + INTERVAL 1 MONTH`
Stage 2: Apply Spark Filter
    - Apply Spark filter `event_type = 'corporate'` to the data read from the view in Stage 1
Result: Data that satisfies both the view filter and the Spark filter

Note: It is important to note that the DAG of the query may vary depending on the specific Spark configuration and the data being processed.

Conclusion

Spark can read data from a view that already has a filter applied during its creation.
There are different scenarios depending on how the Spark filter is applied.
In some cases, Spark will only read the data for the specified partition, improving performance.
In other cases, Spark will read the complete month of data and then apply the filter.
If you apply a filter on a day that falls outside the range of the view query filter, Spark will not return any data.
When Spark adds an additional filter to a view that already has a filter, the DAG of the query will have two stages.

Highlighting points

The performance implications of these different scenarios can vary depending on the size of the data set and the specific Spark configuration.
It is important to understand how Spark handles filters when working with views, in order to optimize performance and avoid unexpected results.