Introduction to PySpark JSON API: Read and Write with Parameters

Handling JSON Data in PySpark: Techniques and Strategies

Ahmed Uz Zaman
7 min readApr 9, 2023
Photo by Ferenc Almasi on Unsplash

Intro

PySpark provides a DataFrame API for reading and writing JSON files. You can use the read method of the SparkSession object to read a JSON file into a DataFrame, and the write method of a DataFrame to write a DataFrame as a JSON file.

Some Spark only features related to JSON include:

  • Ability to automatically infer the schema of the JSON data based on its structure, using the inferSchema option.
  • Support for reading and writing compressed JSON files, such as gzip or bzip2.
  • Support for nested JSON structures and arrays, allowing for complex JSON data to be processed.

The advantages of using Spark for reading and writing JSON files include:

  • Scalability: Spark can process large JSON files distributed across multiple nodes in a cluster, allowing for faster processing times.
  • Flexibility: Spark can handle a variety of data formats, including nested JSON structures and arrays, giving users greater flexibility in their data processing workflows.
  • Performance optimizations: Spark includes several optimizations for reading and writing data, such as data partitioning and data compression, which can improve performance when working with large JSON datasets.

Examples: Read and Write JSON using PySpark into a DataFrame

How to read a single JSON file

from pyspark.sql import SparkSession

# create a SparkSession
spark = SparkSession.builder.appName("ReadJSON").getOrCreate()

# read a JSON file into a DataFrame
df = spark.read.json("path/to/file.json")

# show the first 10 rows of the DataFrame
df.show(10)

How to read multiple JSON files

from pyspark.sql import SparkSession

# create a SparkSession
spark = SparkSession.builder.appName("ReadMultipleJSONFiles").getOrCreate()

# define a list of file paths
file_paths = ["path/to/file1.json", "path/to/file2.json", "path/to/file3.json"]

# read the JSON files into a single DataFrame
df = spark.read.json(file_paths)

# show the first 10 rows of the DataFrame
df.show(10)

Note that when reading multiple JSON files, Spark will automatically concatenate the data from all files into a single DataFrame, assuming that the schema of each file is the same. If the schemas are different, you may need to specify the schema manually using the schema option in the read method.

Read JSON file with user inferred Schema

Suppose we have a JSON file data.json with the following contents:

{"name": "John", "age": 30, "city": "New York"}
{"name": "Jane", "age": 25, "city": "Los Angeles"}
{"name": "Bob", "age": 40, "city": "Chicago"}

We can use the following code to read the file with a user-inferred schema:

from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType

# create a SparkSession
spark = SparkSession.builder.appName("ReadJSONWithInferredSchema").getOrCreate()

# define the schema for the JSON data
schema = StructType([
StructField("name", StringType(), True),
StructField("age", IntegerType(), True),
StructField("city", StringType(), True)
])

# read the JSON file with the specified schema
df = spark.read.schema(schema).json("data.json")

# show the DataFrame
df.show()

In this example, we first define the schema for the JSON data using the StructType and StructField classes. We specify the field names and data types for each field.

Then we use the read method with the schema option to read the JSON file with the specified schema. The resulting DataFrame will have columns name, age, and city, with data types StringType, IntegerType, and StringType, respectively.

If the JSON file contains fields that are not defined in the schema, they will be ignored. If the schema is not compatible with the JSON data, an error will be raised.

Read all JSON files within a Directory

from pyspark.sql import SparkSession

# create a SparkSession
spark = SparkSession.builder.appName("ReadAllJSONFiles").getOrCreate()

# define the directory path
dir_path = "path/to/json/files/*.json"

# read all JSON files in the directory into a single DataFrame
df = spark.read.json(dir_path)

# show the first 10 rows of the DataFrame
df.show(10)

In this example, we define the directory path using a wildcard (*) to include all JSON files in the specified directory. Then we use the read method with the json file format to read all files in the directory into a single DataFrame.

Note that when reading all files from a directory, Spark will automatically concatenate the data from all files into a single DataFrame, assuming that the schema of each file is the same. If the schemas are different, you may need to specify the schema manually using the schema option in the read method.

Write a JSON file to a Directory

To write a DataFrame as a JSON file, you can use the write method of the DataFrame, specifying the file format as "json". Here's an example:

from pyspark.sql import SparkSession

# create a SparkSession
spark = SparkSession.builder.appName("WriteJSONFile").getOrCreate()

# create a sample DataFrame
data = [("John", 30, "New York"), ("Jane", 25, "Los Angeles"), ("Bob", 40, "Chicago")]
df = spark.createDataFrame(data, ["name", "age", "city"])

# define the directory path where you want to save the file
dir_path = "path/to/json/files"

# write the DataFrame as a JSON file to the specified directory
df.write.json(dir_path)

# show the first 10 rows of the DataFrame
df.show(10)

In this example, we first create a sample DataFrame with three columns name, age, and city. Then we define the directory path where we want to save the file.

Finally, we use the write method with the json file format to write the DataFrame as a JSON file to the specified directory. The resulting file will be saved in the specified directory with a name like part-00000-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx.json (the xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx part is a unique identifier).

Writing above JSON with Options

Note that you can also specify other options in the write method, such as mode (overwrite, append, or ignore), compression (gzip, snappy, or none), and dateFormat (default is yyyy-MM-dd HH:mm:ss). For example, to compress the output file using gzip, you can use the following code:

df.write.option("compression", "gzip").json(dir_path)

Parameters/ Options while Reading JSON

When reading JSON files using PySpark, you can specify various parameters using options in the read method. Here are some of the most commonly used parameters:

  • path: The path to the JSON file or directory containing JSON files. This parameter is required.
  • schema: The schema to use when reading the JSON file. You can use the StructType class to define the schema, or you can set it to None to infer the schema automatically from the data.
  • multiLine: Specifies whether the JSON file contains multiline records. Set this parameter to True if the JSON file contains records that span multiple lines.
  • mode: Specifies the behavior when encountering corrupt records. Possible values are PERMISSIVE (default), DROPMALFORMED, and FAILFAST.
  • columnNameOfCorruptRecord: The name of the column to use for corrupt records. This parameter is used only when the mode parameter is set to PERMISSIVE.
  • dateFormat: The date format to use when parsing dates in the JSON file. The default format is yyyy-MM-dd'T'HH:mm:ss.SSSZZ.
  • timestampFormat: The timestamp format to use when parsing timestamps in the JSON file. The default format is yyyy-MM-dd'T'HH:mm:ss.SSSZZ.

Here’s an example of how to read a JSON file with some of these parameters:

from pyspark.sql import SparkSession

# create a SparkSession
spark = SparkSession.builder.appName("ReadJSONFileWithParameters").getOrCreate()

# define the JSON file path
json_file_path = "path/to/json/file.json"

# define the schema for the JSON data
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
schema = StructType([
StructField("name", StringType(), True),
StructField("age", IntegerType(), True),
StructField("city", StringType(), True)
])

# read the JSON file with parameters
df = spark.read \
.option("multiLine", "true") \
.option("mode", "PERMISSIVE") \
.option("columnNameOfCorruptRecord", "corrupt_record") \
.option("dateFormat", "MM/dd/yyyy") \
.schema(schema) \
.json(json_file_path)

# show the DataFrame
df.show()

In this example, we specify the path to the JSON file using the path parameter, and we define the schema using the schema parameter. We also specify the multiLine, mode, columnNameOfCorruptRecord, and dateFormat parameters. Note that you can specify other parameters as well, depending on your needs.

Parameters/ Options while Writing JSON

When writing JSON files using PySpark, you can specify various parameters using options in the write method. Here are some of the most commonly used parameters:

  • path: The path to the output JSON file or directory. This parameter is required.
  • mode: Specifies the behavior when encountering existing files. Possible values are append, overwrite, ignore, and error (default).
  • compression: Specifies the compression codec to use when writing the output file. Possible values are gzip, snappy, and none (default).
  • dateFormat: The date format to use when writing dates in the output file. The default format is yyyy-MM-dd'T'HH:mm:ss.SSSZZ.
  • timestampFormat: The timestamp format to use when writing timestamps in the output file. The default format is yyyy-MM-dd'T'HH:mm:ss.SSSZZ.
  • lineSep: The separator to use between lines in the output file. The default is the newline character (\n).
  • encoding: The character encoding to use when writing the output file. The default is UTF-8.

Here’s an example of how to write a PySpark DataFrame as a JSON file with some of these parameters:

from pyspark.sql import SparkSession

# create a SparkSession
spark = SparkSession.builder.appName("WriteJSONFileWithParameters").getOrCreate()

# create a sample DataFrame
data = [("John", 30, "New York"), ("Jane", 25, "Los Angeles"), ("Bob", 40, "Chicago")]
df = spark.createDataFrame(data, ["name", "age", "city"])

# define the output JSON file path
json_file_path = "path/to/output/file.json"

# write the DataFrame as a JSON file with parameters
df.write \
.option("compression", "gzip") \
.option("dateFormat", "MM/dd/yyyy") \
.option("lineSep", "\r\n") \
.json(json_file_path)

# show the first 10 rows of the DataFrame
df.show(10)

In this example, we specify the output file path using the path parameter, and we use the compression, dateFormat, and lineSep parameters. Note that you can specify other parameters as well, depending on your needs.

Conclusion

In this article, we have learned about how to use PySpark JSON API to read and write data. Then we can use it to perform various Data Transformations, Data Analysis, Data Science, etc. Do check out my other articles on PySpark DataFrame API, SQL Basics, and Built-in Functions. Enjoy Reading.

--

--

Ahmed Uz Zaman

Lead QA Engineer | ETL Test Engineer | PySpark | SQL | AWS | Azure | Improvising Data Quality through innovative technologies | linkedin.com/in/ahmed-uz-zaman/