NO NEED FOR ARCHIVING DATA ANYMORE
LAYERS — LIGHT UTIL BUILT FOR PARTITION PRUNING
Big data processing with Apache Spark often involves reading massive datasets and applying filter conditions to extract relevant information. However, the default behavior of Spark involves scanning all files in the base path before applying filters, leading to suboptimal performance, especially when dealing with large volumes of data.
The Challenge
Consider a scenario where you have a DataFrame (df
) in Spark, and you want to apply filters on specific columns:
val dfAfterApplyingFilters = df
.filter(col("partn_col1") === "val1"
&& col("partn_col2") =!= 123.5
&& col("non_partn_col").isin(234.6, 345.7))
In this case, Spark scans all files in the base path before applying the filters, which can be time-consuming and inefficient for big data sets.
More examples can be found here: https://github.com/Saurabh975/layers/blob/main/src/test/scala/io/github/saurabh975/layers/util/FilterPathsBasedOnPredicateTest.scala
Introducing Layers
Enter Layers, a utility designed to optimize Spark reads. This library focuses on providing Spark with only the relevant paths, ensuring that data is processed efficiently in parallel. By leveraging Spark’s distributed computing capabilities, Layers eliminates the need for Spark to optimize the read time.
How Layers Works
Let’s dive into an example of how Layers can optimize Spark reads. Instead of relying on Spark to filter the DataFrame, Layers enables you to provide only the relevant paths:
val basePath: String = "base_path" // path from where you wanted to read values
val predicates: Map[Column, List[Predicate]] = Map(
Column("partn_col1") -> List(equal("val1")),
Column("partn_col2") -> List(notEqual(123.5)))
val filteredPaths: List[Path] = FilterPathsBasedOnPredicate.filter(spark, List(basePath), predicates)
val dataWithFilteredPaths = ORCReader.read(spark,
Map("basePath" -> basePath),
filteredPaths.map(_.toString): _*)
val dfAfterApplyingFilters = dataWithFilteredPaths
.filter(col("non_partn_col").isin(234.6, 345.7))
Operators Supported
Layers supports a variety of operators for both numeric and string values, including
- For Numeric values(Int, Long, Float Double only)
- `<`
- `<=`
- `>`
- `>=`
- `between`
- `equal`
- `notEqual`
- `in`
- For String values
- `equal`
- `notEqual`
- `in`
Additional Utilities
In addition to path optimisation, Layers provides a read utility supporting various file formats
- ORCReader for reading ORC Files
- ParquetReader for reading parquet files
- JSONReader for reading JSON Files
- And, CSVReader for CSV Files
Including it into your repository
- build.sbt
"io.github.saurabh975" %% "layers" % "1.0.0"
2. pom.xml
<dependency>
<groupId>io.github.saurabh975</groupId>
<artifactId>layers_2.13</artifactId>
<version>1.0.0</version>
</dependency>
Conclusion
With Layers, you can take control of your Spark reads, optimizing performance and processing big data efficiently. Embrace the power of distributed computing and let Spark focus on what it does best.
Ready to elevate your Spark experience? Try Layers today!