What and How to use from_json and exploed functions a json column in csv files with PySpark

M. Cagri AKTAS
7 min readOct 13, 2023

--

PySpark is a powerful tool for processing large datasets, as you wish you can check my other article, and it provides various functions to work with structured data, including JSON. In this article, we will explore how to use two essential functions, “from_json” and “exploed”, to manipulate JSON data within CSV files using PySpark.

If you want to check all code and I put four file in this repository, you can visit my github repository.

  1. Why we’re doing this?
  2. What’s json format?
  3. What’s from_json and exploed functions in PySpark.
  4. Summary

Why we’re doing this?

Accually, I’m working on Data Engineering, and I set a challenge with tmdb_5000_credits.csv dataset and it felt comfortable because the .csv file looking so basic :). However, there was a problem. The CSV file had a different column format due to JSON data in it. I wanted to explore and see all the cast and crew person! But when I tried to read the CSV file, the cast and crew columns returned as strings and I could’nt read and exploed. And I can’t use loop for this column to spliet to everyperson! I thought there’re must be a epic way! So, I found a solution. When I read the .csv file with pandas then create a PySpark DataFrame, I can easily extract this columns. I got excited when I saw information about each person row by row because I only had to type two function, and PySpark did the work for us! Anyway, let’s start with the basics of JSON format. :)

What’s json format?

The json format is meaning, “JavaScript Object Notation”. JSON is a lightweight format for storing and transporting data.

  • Data is in name, value pairs
  • Data is separated by commas
  • Curly braces hold objects
  • Square brackets hold arrays

If I would to talk about json format, those are probobly list of below:

Structure: you can store your data in a more complex way that doesn’t fit into traditional CSV files, making it easier to use JSON format. As you can see, you can save a lot of information about your cast in just one row! This provides us with a great advantage. I’ll show you how you can extract this information from one person to another.

83 person * 7 info (like column) = 581 information in a cell!

Flexibility: JSON is a flexible and versatile data format, which can handle various data types, including arrays and nested objects, making it suitable for a wide range of data types and structures.

Readability: JSON is human-readable, which makes it easier for humans to understand and work with the data directly.

Interoperability: JSON is a widely supported data format, and many programming languages and tools have libraries for parsing and manipulating JSON data. As you will able to see we’ll use PySpark in Python.

What’s json_format and exploted function in PySpark?

Firstly we can pull this dataset from Erkan Şirin’s dataset repository:

The dataset has four column but the third column named as “cast”, the column has seven columns, and fourth column named as “crew”, the column has six columns, I have attached a screen-shout for you, you can check the ss. again at the top.

Our dataset’s first view:

df.show(10)
df.printSchema()

+--------+--------------------+--------------------+--------------------+
|movie_id| title| cast| crew|
+--------+--------------------+--------------------+--------------------+
| 19995| Avatar|[{"cast_id": 242,...|[{"credit_id": "5...|
| 285|Pirates of the Ca...|[{"cast_id": 4, "...|[{"credit_id": "5...|
| 206647| Spectre|[{"cast_id": 1, "...|[{"credit_id": "5...|
| 49026|The Dark Knight R...|[{"cast_id": 2, "...|[{"credit_id": "5...|
| 49529| John Carter|[{"cast_id": 5, "...|[{"credit_id": "5...|
| 559| Spider-Man 3|[{"cast_id": 30, ...|[{"credit_id": "5...|
| 99861|Avengers: Age of ...|[{"cast_id": 76, ...|[{"credit_id": "5...|
+--------+--------------------+--------------------+--------------------+
only showing top 10 rows

root
|-- movie_id: long (nullable = true)
|-- title: string (nullable = true)
|-- cast: string (nullable = true)
|-- crew: string (nullable = true)

then we must create schema for cast and crew columns because we’ll get every info to single column so I’ll firstly create cast column’s schema!

Ps: I’ll check the .csv file then defination to DataType.

cast_schema = ArrayType(StructType([
StructField('cast_id', IntegerType(), True),
StructField('character', StringType(), True),
StructField('credit_id', StringType(), True),
StructField('gender', IntegerType(), True),
StructField('id', IntegerType(), True),
StructField('name', StringType(), True),
]))

from_json function: when we use Apache Spark’s DataFrame API’s from_json, the function is runnig in the column and create new column. Focus the diffirent of cast and crew column. Our cast olumn ready to exploed right now. :)

df = df.withColumn("cast", from_json(col("cast"), cast_schema))

df.show(5)
# Key value gone in cast column!
+--------+--------------------+--------------------+--------------------+
|movie_id| title| cast| crew|
+--------+--------------------+--------------------+--------------------+
| 19995| Avatar|[{242, Jake Sully...|[{"credit_id": "5...|
| 285|Pirates of the Ca...|[{4, Captain Jack...|[{"credit_id": "5...|
| 206647| Spectre|[{1, James Bond, ...|[{"credit_id": "5...|
| 49026|The Dark Knight R...|[{2, Bruce Wayne ...|[{"credit_id": "5...|
| 49529| John Carter|[{5, John Carter,...|[{"credit_id": "5...|
+--------+--------------------+--------------------+--------------------+
only showing top 5 rows

Now you can see exploed column (cast) and not exploed (crew) colum.

Exploed function: when you have your information in a JSON column, the “explode” function splits each person in that column for you. Do you see what a cool thing it is? :)

df = df.select("movie_id", "title", explode(col("cast")).alias("cast"))

df.show(5)
# did you see the cast column, the square brackets gone when we used to exploed
+--------+------+--------------------+--------------------+
|movie_id| title| cast| crew|
+--------+------+--------------------+--------------------+
| 19995|Avatar|{242, Jake Sully,...|[{"credit_id": "5...|
| 19995|Avatar|{3, Neytiri, 52fe...|[{"credit_id": "5...|
| 19995|Avatar|{25, Dr. Grace Au...|[{"credit_id": "5...|
| 19995|Avatar|{4, Col. Quaritch...|[{"credit_id": "5...|
| 19995|Avatar|{5, Trudy Chacon,...|[{"credit_id": "5...|
+--------+------+--------------------+--------------------+
only showing top 5 rows

So we’re ready for build our column! Les’t create then :)

# Building our column!
df = df.withColumn("cast_id", col("cast.cast_id"))
df = df.withColumn("character", col("cast.character"))
df = df.withColumn("credit_id", col("cast.credit_id"))
df = df.withColumn("gender", col("cast.gender"))
df = df.withColumn("id", col("cast.id"))
df = df.withColumn("name", col("cast.name"))

# I just want to show cast.*, as you can see, you can easily select with that!
df = df.select("movie_id", "title", col("cast.*"), "crew")

# If you scrool to right side you'll see the crew column for diffirent.
+--------+------+-------+-------------------+--------------------+------+-----+------------------+--------------------+
|movie_id| title|cast_id| character| credit_id|gender| id| name| crew|
+--------+------+-------+-------------------+--------------------+------+-----+------------------+--------------------+
| 19995|Avatar| 242| Jake Sully|5602a8a7c3a368553...| 2|65731| Sam Worthington|[{"credit_id": "5...|
| 19995|Avatar| 3| Neytiri|52fe48009251416c7...| 1| 8691| Zoe Saldana|[{"credit_id": "5...|
| 19995|Avatar| 25|Dr. Grace Augustine|52fe48009251416c7...| 1|10205| Sigourney Weaver|[{"credit_id": "5...|
| 19995|Avatar| 4| Col. Quaritch|52fe48009251416c7...| 2|32747| Stephen Lang|[{"credit_id": "5...|
| 19995|Avatar| 5| Trudy Chacon|52fe48009251416c7...| 1|17647|Michelle Rodriguez|[{"credit_id": "5...|
+--------+------+-------+-------------------+--------------------+------+-----+------------------+--------------------+
only showing top 5 rows

root
|-- movie_id: long (nullable = true)
|-- title: string (nullable = true)
|-- cast_id: integer (nullable = true)
|-- character: string (nullable = true)
|-- credit_id: string (nullable = true)
|-- gender: integer (nullable = true)
|-- id: integer (nullable = true)
|-- name: string (nullable = true)
|-- crew: string (nullable = true)

We have done everything, and now we can analyze our dataset very easily. :)

Lastly, les’t build same processes for crew column.

crew_schema = ArrayType(StructType([
StructField("credit_id", StringType(), True),
StructField("department", StringType(), True),
StructField("gender", IntegerType(), True),
StructField("id", IntegerType(), True),
StructField("job", StringType(), True),
StructField("name", StringType(), True),
]))

df = df.withColumn("crew", from_json(col("crew"), crew_schema))

df = df.select("movie_id", "title", explode(col("crew")).alias("crew"))

df = df.withColumn("credit_id", col("crew.credit_id"))
df = df.withColumn("department", col("crew.department"))
df = df.withColumn("gender", col("crew.gender"))
df = df.withColumn("id", col("crew.id"))
df = df.withColumn("job", col("crew.job"))
df = df.withColumn("name", col("crew.name"))

df = df.select("movie_id", "title", col("crew.*"))

df.show(5)
df.printSchema()

+--------+------+--------------------+----------+------+----+--------------------+-----------------+
|movie_id| title| credit_id|department|gender| id| job| name|
+--------+------+--------------------+----------+------+----+--------------------+-----------------+
| 19995|Avatar|52fe48009251416c7...| Editing| 0|1721| Editor|Stephen E. Rivkin|
| 19995|Avatar|539c47ecc3a36810e...| Art| 2| 496| Production Design| Rick Carter|
| 19995|Avatar|54491c89c3a3680fb...| Sound| 0| 900| Sound Designer|Christopher Boyes|
| 19995|Avatar|54491cb70e0a26748...| Sound| 0| 900|Supervising Sound...|Christopher Boyes|
| 19995|Avatar|539c4a4cc3a36810c...|Production| 1|1262| Casting| Mali Finn|
+--------+------+--------------------+----------+------+----+--------------------+-----------------+
only showing top 5 rows

root
|-- movie_id: long (nullable = true)
|-- title: string (nullable = true)
|-- credit_id: string (nullable = true)
|-- department: string (nullable = true)
|-- gender: integer (nullable = true)
|-- id: integer (nullable = true)
|-- job: string (nullable = true)
|-- name: string (nullable = true)

Summary

Actually, these are basic functions, but I believe they are essential for data engineering. Because when you collect data from logs or other sources, you might end up with a dataset in JSON format, who know? :)

I hope the article could help you… Cheers in peace everyone… ^^

--

--