From CSV to Parquet: A Journey Through File Formats in Apache Spark with Scala

Ajaykumar Dev
3 min readApr 1, 2024

--

Firstly, we will learn how to read data from different file formats.

import org.apache.spark.sql.SparkSession

// Create a SparkSession
val spark = SparkSession.builder()
.appName("ReadDiffFileFormatsExample")
.master("local[*]")
.getOrCreate()

// Read csv file into DataFrame
val csvDf = spark.read.option("header", "true").csv("/FileStore/tables/orders.csv")

// Read Parquet file into DataFrame
val parquetDf = spark.read.parquet("/FileStore/tables/orders.parquet")

// Read JSON file into DataFrame
val jsonDf = spark.read.json("/FileStore/tables/orders.json")

// Read Avro file into DataFrame
val avroDf = spark.read.format("avro").load("/FileStore/tables/orders.avro")

// Show DataFrame sample data from csvDf
csvDf.show()

// Show DataFrame sample data from parquetDf
parquetDf.show()

// Show DataFrame sample data from jsonDf
jsonDf.show()

// Show DataFrame sample data from avroDf
avroDf.show()

// Stop the SparkSession
spark.stop()
  1. Creating SparkSession:
  • We start by creating a SparkSession named spark using SparkSession.builder(). This is the entry point to Spark SQL functionality.

2. Reading CSV file into DataFrame:

  • We use spark.read.option("header", "true").csv("/FileStore/tables/orders.csv") to read a CSV file (orders.csv) into a DataFrame named csvDf. The option "header", "true" is used to treat the first row as header.

3. Reading Parquet file into DataFrame:

  • We read a Parquet file (orders.parquet) into a DataFrame named parquetDf using spark.read.parquet("/FileStore/tables/orders.parquet").

4. Reading JSON file into DataFrame:

  • We read a JSON file (orders.json) into a DataFrame named jsonDf using spark.read.json("/FileStore/tables/orders.json").

5. Reading Avro file into DataFrame:

  • We read an Avro file (orders.avro) into a DataFrame named avroDf using spark.read.format("avro").load("/FileStore/tables/orders.avro").

6. Showing DataFrame sample data:

  • We use the show() method to display a sample of data from each DataFrame (csvDf, parquetDf, jsonDf, avroDf).

7. Stopping SparkSession:

  • Finally, we stop the SparkSession using spark.stop() to release the resources.

This code snippet demonstrates how to read various file formats into Spark DataFrames, which can then be used for further processing and analysis. It provides a basic understanding of working with different file formats in Apache Spark using Scala.

Secondly, we will learn how to read data from different file formats.

import org.apache.spark.sql.SparkSession

// Create a SparkSession
val spark = SparkSession.builder()
.appName("WriteDiffFileFormatsExample")
.master("local[*]")
.getOrCreate()

// Read data into DataFrame
val df = spark.read.option("header", "true").csv("/FileStore/tables/input_data.csv")

// Write DataFrame to CSV with partitioning by "date" column
df.write
.format("csv")
.partitionBy("date")
.option("header", "true")
.save("/FileStore/tables/output_csv")

// Write DataFrame to Parquet with partitioning by "date" column
df.write
.format("parquet")
.partitionBy("date")
.save("/FileStore/tables/output_parquet")

// Write DataFrame to JSON with partitioning by "date" column
df.write
.format("json")
.partitionBy("date")
.save("/FileStore/tables/output_json")

// Write DataFrame to Avro with partitioning by "date" column
df.write
.format("avro")
.partitionBy("date")
.save("/FileStore/tables/output_avro")

// Stop the SparkSession
spark.stop()

In this code snippet:

  • We first create a SparkSession.
  • Then, we read the input data from a CSV file into a DataFrame df.
  • We write df into different file formats (CSV, Parquet, JSON, Avro) with partitioning by the "date" column.
  • For each file format, we specify the format using .format("format_name"), partitioning column using .partitionBy("column_name"), and output directory using .save("output_path").
  • Finally, we stop the SparkSession to release resources.

Here, you can specify whether to append data to an existing output directory or overwrite it completely using the mode option when writing DataFrames to files. The available modes are:

  • append: Appends the output data to the existing directory.
  • overwrite: Overwrites the existing directory with the new output data.
  • ignore: Ignores writing and does nothing if the output directory already exists.
  • error (default): Throws an error if the output directory already exists.

The example is as below,

// Assuming df is your DataFrame

// Write DataFrame to CSV with append mode
df.write
.format("csv") // Specify the format as CSV
.option("header", "true") // Include headers in the output
.mode("append") // Specify the mode as append
.save("path/to/output/csv") // Specify the output directory

In conclusion, we’ve learned to efficiently read and write data across multiple file formats in Apache Spark using Scala, while also leveraging partitioning to optimize data storage and retrieval.

Thank you for Reading

If you like this post:

  • Please show your support with a clap 👏 or several claps!
  • Feel free to share this guide with your friends.
  • Your feedback is invaluable — it inspires and guides my future posts.
  • Or drop me a message: https://www.linkedin.com/in/ajaykumar-dev/

--

--