NO NEED FOR ARCHIVING DATA ANYMORE

Published in

The Thought Mill

2 min readJan 30, 2024

LAYERS — LIGHT UTIL BUILT FOR PARTITION PRUNING

Big data processing with Apache Spark often involves reading massive datasets and applying filter conditions to extract relevant information. However, the default behavior of Spark involves scanning all files in the base path before applying filters, leading to suboptimal performance, especially when dealing with large volumes of data.

The Challenge

Consider a scenario where you have a DataFrame (df) in Spark, and you want to apply filters on specific columns:

val dfAfterApplyingFilters = df
  .filter(col("partn_col1") === "val1"
   && col("partn_col2") =!= 123.5
   && col("non_partn_col").isin(234.6, 345.7))

In this case, Spark scans all files in the base path before applying the filters, which can be time-consuming and inefficient for big data sets.

More examples can be found here: https://github.com/Saurabh975/layers/blob/main/src/test/scala/io/github/saurabh975/layers/util/FilterPathsBasedOnPredicateTest.scala

Introducing Layers

Enter Layers, a utility designed to optimize Spark reads. This library focuses on providing Spark with only the relevant paths, ensuring that data is processed efficiently in parallel. By leveraging Spark’s distributed computing capabilities, Layers eliminates the need for Spark to optimize the read time.

How Layers Works

Let’s dive into an example of how Layers can optimize Spark reads. Instead of relying on Spark to filter the DataFrame, Layers enables you to provide only the relevant paths:

val basePath: String = "base_path" // path from where you wanted to read values
val predicates: Map[Column, List[Predicate]] = Map(
  Column("partn_col1") -> List(equal("val1")),
  Column("partn_col2") -> List(notEqual(123.5)))

val filteredPaths: List[Path] = FilterPathsBasedOnPredicate.filter(spark, List(basePath), predicates)
val dataWithFilteredPaths = ORCReader.read(spark,
  Map("basePath" -> basePath),
  filteredPaths.map(_.toString): _*)
val dfAfterApplyingFilters = dataWithFilteredPaths
  .filter(col("non_partn_col").isin(234.6, 345.7))

Operators Supported

Layers supports a variety of operators for both numeric and string values, including

- For Numeric values(Int, Long, Float Double only)
  - `<` 
  - `<=`
  - `>`
  - `>=`
  - `between`
  - `equal`
  - `notEqual`
  - `in`

- For String values
    - `equal`
    - `notEqual`
    - `in`

Additional Utilities

In addition to path optimisation, Layers provides a read utility supporting various file formats

ORCReader for reading ORC Files
ParquetReader for reading parquet files
JSONReader for reading JSON Files
And, CSVReader for CSV Files

Including it into your repository

build.sbt

"io.github.saurabh975" %% "layers" % "1.0.0"

2. pom.xml

<dependency>
  <groupId>io.github.saurabh975</groupId>
  <artifactId>layers_2.13</artifactId>
  <version>1.0.0</version>
</dependency>

Conclusion

With Layers, you can take control of your Spark reads, optimizing performance and processing big data efficiently. Embrace the power of distributed computing and let Spark focus on what it does best.

Ready to elevate your Spark experience? Try Layers today!