Data Preparation: Getting spark ready with StreamSets Data Collector

Rishi Jain
Feb 1 · 4 min read

Spark can analyze data stored on files in many different formats: plain text, JSON, XML, Parquet, and more. But just because you can get a Spark job to run on a given data input format doesn’t mean you’ll get the same performance with all of them. Actually, the performance difference can be quite substantial.

When you are designing your datasets for your Spark application, you need to ensure that you are making the best use of the file formats available Spark.

Spark File formats consideration

  • Spark is optimized for Apache Parquet and ORC for read throughput. Spark has vectorization support that reduces disk I/O. Columnar formats work well.
  • Use the Parquet file format and make use of compression.
  • There are different file formats and built-in data sources that can be used in Apache Spark.Use splittable file formats.
  • Ensure that there are not too many small files. If you have many small files, it might make sense to do compaction of them for better performance.

Now in many given scenarios, you won’t have your data in this format and as a Data engineer, it’s your job to get this data ready for downstream applications.

In this blog, we will look at such an example and use StreamSets Data Collector to prepare our dataset

Let’s say we have several thousand small files in Google Cloud Storage. We want to convert the numerous small file to large Parquet files so we can more quickly analyze the data.

To do this, we set up two pipelines. The first pipeline reads the Google cloud Storage and writes large Avro files to a local file system. The second Parquet Conversion pipeline reads the Avro files as whole files and transforms each file to a corresponding Parquet file, which is written back to GCS.

Image for post
Image for post

Pipeline — 1: Write Avro Pipeline

Image for post
Image for post

The Write Avro pipeline can be pretty simple: it just needs to write large Avro files. You can make it complicated by performing additional processing, but at its most basic, all you need is an origin and an Avro-enabled file-based destination to write the Avro files.

Local FS destination the Local FS destination, we carefully configure the following properties:

  • Directory Template — This property determines where the Avro files are written. The directory template that we use determines where the origin of the second pipeline picks up the Avro files. We’ll use .
  • File Prefix — This is an optional prefix for the output file name.
  • When writing to a file, the Local FS destination creates a temporary file that uses as a file name prefix. To ensure that the second pipeline picks up only fully-written output files, we define a file prefix for the output files. It can be something simple, like
  • Read Order — To read the files in the order they are written, we use .
  • Max File Size — The Avro files are converted to Parquet files in the second pipeline, so we want to ensure that these output files are fairly large.
  • Let’s say we want the Parquet files to be roughly 3 GB in size, so we set Max File Size to .
  • Important: To transform the Avro files, the Whole File Transformer requires Data Collector memory and storage equivalent to the maximum file size. Consider this requirement when setting the maximum file size.
  • Data Format — We select as the data format, and specify the location of the Avro schema to be used.

Pipeline 2: Parquet Conversion pipeline

Image for post
Image for post

The Parquet Conversion pipeline is simple as well. It uses the whole file data format, so no processing of file data can occur except the conversion of the Avro files to Parquet.

Note the following configuration details:

In the Directory origin, we configure the following properties:

  • Files Directory — To read the files written by the first pipeline, point this to the directory used in the Local FS Directory Template property: .
  • File Name Pattern — To pick up all output files, use a glob for file name pattern: .
  • We use the prefix in the pattern to avoid reading the active temporary files generated by the Local FS destination.
  • Data Format — We use to stream the data to the Whole File Transformer.

We configure the other properties as needed. The Whole File Transformer processor converts Avro files to Parquet in memory, then writes temporary Parquet files to a local directory. These whole files are then streamed from the local directory to GCS. When using the Whole File Transformer, we must ensure that the Data Collector machine has the required memory and storage available. Since we set the maximum file size in the Write Avro pipeline to 3 GB, we must ensure that Data Collector has 3 GB of available memory and storage before running this pipeline. In the Whole File Transformer, we also have the following properties to configure:

  • Job Type — We’re using the job.
  • Temporary File Directory — This is a directory local to Data Collector for the temporary Parquet files. We can use the default, .

We can configure other temporary Parquet file properties and Parquet conversion properties as well, but the defaults are fine in this case.

Conclusion: In this blog, we learned that how to leverage the Whole File Transformer processor for converting Avro files to Parquet to meet our spark application requirement.

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data…

Sign up for Analytics Vidhya News Bytes

By Analytics Vidhya

Latest news from Analytics Vidhya on our Hackathons and some of our best articles! Take a look.

By signing up, you will create a Medium account if you don’t already have one. Review our Privacy Policy for more information about our privacy practices.

Check your inbox
Medium sent you an email at to complete your subscription.

Rishi Jain

Written by

Software Support Engineer @StreamSets | Hadoop | DataOps | RHCA | Ex-RedHatter | Ex-Cloudera

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

Rishi Jain

Written by

Software Support Engineer @StreamSets | Hadoop | DataOps | RHCA | Ex-RedHatter | Ex-Cloudera

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store