Project Tungsten and Catalyst SQL optimizer

saurabh goyal
Nov 30, 2018 · 3 min read

Project Tungsten/off-heap Searlizier

The goal of tungsten substantially improve the memory and CPU efficiency of the Spark applications and push the limits of the underlying hardware. The focus on CPU efficiency is motivated by the fact that Spark workloads are increasingly bottlenecked by CPU and memory use rather than IO and network communication.

Memory Optimization

As there are many memory overheads while writing the object to java heap.

Consider a simple string “abcd” that would take 4 bytes to store using UTF-8 encoding. JVM’s native String implementation, however, stores this differently to facilitate more common workloads. It encodes each character using 2 bytes with UTF-16 encoding, and each String object also contains a 12 byte header and 8 byte hash code.

Manual memory management by leverage application semantics, which can be very risky if you do not know what you are doing, is a blessing with Spark. We used knowledge of data schema (DataFrames) to directly layout the memory ourselves. It not only gets rid of GC overheads but lets you minimize the memory footprint. Schema information help to serialized data in less memory.

There are encoders available for Primitive types (Int, String, etc) and Product types (case classes) are supported by importing sqlContext.implicits._ for serializing data.

Aggregation and sorting operation can be done over serialized data itself.

Code Generation

Code generation can be used to optimize the CPU efficiency of internal components. code generation is to speed up the conversion of data from in-memory binary format to wire-protocol for the shuffle. As mentioned earlier, the shuffle is often bottlenecked by data serialization rather than the underlying network. With code generation, we can increase the throughput of serialization, and in turn, increase shuffle network throughput.

The code generated serializer exploits the fact that all rows in a single shuffle have the same schema and generates specialized code for that. This made the generated version over 2X faster to shuffle than the Kryo version.

Code Generation also improves efficiency for generating better and optimized bytecodes for relational expression.

https://spoddutur.github.io/spark-notes/deep_dive_into_storage_formats.html

Encoders

Catalyst Query optimizer

Catalyst Compiles Spark SQL programs to an RDD. It optimizes relational expression on DataFrame/DataSet to speed up data processing.

Structured Data is easy to optimize.

Catalyst has knowledge of all the data types and knows the exact schema of our data and has detailed knowledge of computation of we like to perform which helps it to optimize the operations.

Optimizations by Catalyst.

  1. Reordering Operations.

The laziness of transformation operations gives us the opportunity to rearrange/reorder the transformations operations before they are executed.

Catalyst can decide to rearrange the filter operations pushing all filters as early as possible so that expensive operation like join/count is performed on fewer data.

2. Reduce the amount of data we must-read.

Skip reading in, serializing and sending around parts of the dataset that aren’t needed for our computations. It is difficult to find the part of data which are not required inside the RDD because it is not structured but in structured we can easily remove columns which are not required.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade