Spark can analyze data stored on files in many different formats: plain text, JSON, XML, Parquet, and more. But just because you can get a Spark job to run on a given data input format doesn’t mean you’ll get the same performance with all of them. Actually, the performance difference can be quite substantial.
When you are designing your datasets for your Spark application, you need to ensure that you are making the best use of the file formats available Spark.
Spark File formats consideration
- Spark is optimized for Apache Parquet and ORC for read throughput. Spark has vectorization support that reduces disk I/O. Columnar formats work well.
- Use the Parquet file format and make use of compression.
- There are different file formats and built-in data sources that can be used in Apache Spark.Use splittable file formats.
- Ensure that there are not too many small files. If you have many small files, it might make sense to do compaction of them for better performance.
Now in many given scenarios, you won’t have your data in this format and as a Data engineer, it’s your job to get this data ready for downstream applications.
In this blog, we will look at such an example and use StreamSets Data Collector to prepare our dataset
Let’s say we have several thousand small files in Google Cloud Storage. We want to convert the numerous small file to large Parquet files so we can more quickly analyze the data.
To do this, we set up two pipelines. The first pipeline reads the Google cloud Storage and writes large Avro files to a local file system. The second Parquet Conversion pipeline reads the Avro files as whole files and transforms each file to a corresponding Parquet file, which is written back to GCS.
Pipeline — 1: Write Avro Pipeline
The Write Avro pipeline can be pretty simple: it just needs to write large Avro files. You can make it complicated by performing additional processing, but at its most basic, all you need is an origin and an Avro-enabled file-based destination to write the Avro files.
Local FS destination the Local FS destination, we carefully configure the following properties:
- Directory Template — This property determines where the Avro files are written. The directory template that we use determines where the origin of the second pipeline picks up the Avro files. We’ll use
- File Prefix — This is an optional prefix for the output file name.
- When writing to a file, the Local FS destination creates a temporary file that uses
tmp_as a file name prefix. To ensure that the second pipeline picks up only fully-written output files, we define a file prefix for the output files. It can be something simple, like
- Read Order — To read the files in the order they are written, we use
Last Modified Timestamp.
- Max File Size — The Avro files are converted to Parquet files in the second pipeline, so we want to ensure that these output files are fairly large.
- Let’s say we want the Parquet files to be roughly 3 GB in size, so we set Max File Size to
- Important: To transform the Avro files, the Whole File Transformer requires Data Collector memory and storage equivalent to the maximum file size. Consider this requirement when setting the maximum file size.
- Data Format — We select
Avroas the data format, and specify the location of the Avro schema to be used.
Pipeline 2: Parquet Conversion pipeline
The Parquet Conversion pipeline is simple as well. It uses the whole file data format, so no processing of file data can occur except the conversion of the Avro files to Parquet.
Note the following configuration details:
In the Directory origin, we configure the following properties:
- Files Directory — To read the files written by the first pipeline, point this to the directory used in the Local FS Directory Template property:
- File Name Pattern — To pick up all output files, use a glob for file name pattern:
- We use the
avro_prefix in the pattern to avoid reading the active temporary files generated by the Local FS destination.
- Data Format — We use
Whole Fileto stream the data to the Whole File Transformer.
We configure the other properties as needed. The Whole File Transformer processor converts Avro files to Parquet in memory, then writes temporary Parquet files to a local directory. These whole files are then streamed from the local directory to GCS. When using the Whole File Transformer, we must ensure that the Data Collector machine has the required memory and storage available. Since we set the maximum file size in the Write Avro pipeline to 3 GB, we must ensure that Data Collector has 3 GB of available memory and storage before running this pipeline. In the Whole File Transformer, we also have the following properties to configure:
- Job Type — We’re using the
Convert Avro to Parquetjob.
- Temporary File Directory — This is a directory local to Data Collector for the temporary Parquet files. We can use the default,
We can configure other temporary Parquet file properties and Parquet conversion properties as well, but the defaults are fine in this case.
Conclusion: In this blog, we learned that how to leverage the Whole File Transformer processor for converting Avro files to Parquet to meet our spark application requirement.