Analyzing 100GB Datasets With Spark-SQL on a Laptop

Katherine Wang
8 min readOct 9, 2021

Recently, I found myself needing to analyze the United States mortgage application and approval data from the Consumer Financial Protection Bureau. While I had experience with SQL, using mostly MySQL, I had never worked with a dataset as big as this one (25.6M rows,~10GB), much less on my MacBook Air with only 8GB of memory. Thus, I consulted the Internet on how to load 10GB data into MySQL, and found a few links that I thought may help. But when I tried it out, it was incredibly slow and eventually failed after several hours of waiting. I knew that, even if it succeeded, it would take very long to run queries.

Then I was introduced to Spark (more accurately, Spark-SQL), and it handled everything I needed. I was able to load the data in 3 minutes and return most queries in a few seconds. Most importantly, when I duplicated the data 10 times to get a 110GB file, I was still able to load and analyze data on my laptop in less than a minute.

Apache Spark, often shortened to Spark, is an analytics engine created at UC Berkeley’s AMPLab by Dr. Matei Zaharia and popularized by his company Databricks. It is perhaps the most popular big data analytics tool used in companies of all sizes. However, scouring the Internet, I couldn’t find a single article that advocated its use in my predicament: to analyze large datasets on a weak laptop. That’s why I decided to write about it.

Let’s get started!

Download Spark

First, go to spark.apache.org and download Spark according to the site’s directions. For most of us, just click the link (that reads as spark-3.1.2-bin-hadoop3.2.tgz) under “Download Apache Spark” to download. Note that as new releases become available, the link may change.

Navigate to where you saved the download, you should see the newly downloaded file spark-3.1.2-bin-hadoop3.2.tgz. Unzip it; you will get a folder called spark-3.1.2-bin-hadoop3.2.

Spark only has a command-line interface, so you may want to rename the folder to a shorter name, though this is not necessary. Personally, I changed it to “spark” for simplicity. Also, feel free to move the folder wherever you want. From now on, let’s assume spark is located at ~/spark (on a Mac laptop).

Go to ~/spark, and create an empty folder dataset. Then type ./bin/spark-shell to start the engine!

$ cd ~/spark
$ mkdir dataset
$ ./bin/spark-shell

You will see a screen like the following:

Spark Start-up screen. It looks complicated but you do not need to be concerned with the long messages. Just start using Spark by typing commands after the “scala> ” prompt.

The output screen may look complicated, but you do not need to be concerned with the messages at the top for now. Just start using Spark by typing commands after the shell prompt at the bottom. Spark is written in a programming language called Scala, which is why the shell prompt is “scala>”.

Using Spark-SQL

Before we start, download the file to be analyzed from here. Unzip the file and move it to ~/spark/dataset, i.e., the folder we just created. The file name is “2020_lar.csv”, and it contains all mortgage applications made in the United States in 2020.

There are 5 steps we need to perform to get our first query run. It will take about 10 minutes in total:

  1. Read in the CSV file. This step creates a Spark dataframe and took me about 3 minutes.
  2. Write the dataframe out as an ORC file. ORC format is an optimized data format that helps speed up analysis. This is a crucial step for performance in handling large datasets. This step took me about 4 minutes.
  3. Read in data again from the new ORC file. This step is very fast.
  4. Create a database view (named mortgage). This is like a DB table/view, against which queries can be run.
  5. Run SQL queries.

The first 2 steps only need to be done once. The second time you want to start spark and rerun your queries (or new queries), you only need to begin from step 3. Let’s look at each step in detail:

Read Data in from CSV File: Use the following command to load 2020_lar.txt.

val mortgage = spark.read.format("csv").option("sep", "|").option("inferSchema", "true").option("header", "true").load("./dataset/2020_lar.txt")

This command may seem cryptic; in fact, a lot of Spark commands may look like this. However, the command is self-explanatory and probably not hard to understand: the command reads a file in CSV format, with field separator “|”. The file has a header row, and Spark should infer a schema (manner in which data is stored) from the row; the input file is found at ./dataset/2020_lar.txt. Once read, return a dataframe in variable mortgage.

More importantly, you usually only need to copy the command and replace the bolded text. In this example, they are the input file path, the separator, and the variable name to assign the dataframe to. Other alterations can be made depending on your situation, but these are the basics.

This step took 3 minutes to finish.

Write ORC File: To improve query performance, it is important to convert and store data to an ORC format. The following command achieves that by writing the csv’s contents (i.e., dataframe) into a new format. This may take another 3–4 minutes to finish. The option mode(“overwrite”) means to overwrite the existing file. By the way, the “2020_lar.orc” written here is a folder, not a single file; If you go to the folder you can see a lot of files Spark has written. We do not need to be concerned with such details if we only want to run queries.

mortgage.write.format(“orc”).mode("overwrite").save(“./dataset/2020_lar.orc")

Read ORC File: The file we just wrote needs to be read by Spark, so run this next

val mortgage = spark.read.orc(“./dataset/2020_lar.orc")

Create Database View: To run queries in Spark, you will need to create a database table. In Spark’s term, this is a View or TempView. That can be done with the command below. The DB table/view name is called mortgage, which is quoted in the command.

mortgage.createOrReplaceTempView(“mortgage”)

Run Spark-SQL Queries: This is where the real fun begins. As its name suggests, Spark-SQL supports ANSI SQL syntax. To run a query, you type in

spark.sql("[ANSI SQL Statement]").show([n])

Replace the quoted text with the SQL statement you want to run, replace <n> with the number of rows in result you want to show. If you do not pass in a number, Spark defaults to showing 10 rows.

Let’s run our first query to see how many rows the mortgage dataset has:

spark.sql("select count(*) as record_count from mortgage").show()

This returns the number of rows in the view mortgage (25.6M). To find loan application count grouped by race (applicant_race_1), we run this SQL:

spark.sql("select applicant_race_1, count(*) from mortgage group by 1 order by 1").show(5)

We are only printing the first 5 rows in the result set.

We are done, in less than 12 minutes! The actual running time is in fact less than 10 minutes. As a caveat, you can also put the 6 commands in a text file, and paste all into Spark at once, instead of copy/paste each command individually.

Running Queries with Joins

Spark-SQL also supports joins. If there is another CSV file you want to use, return to step 1 and load it like when loading the mortgage file. However, if the other file is small, you may not need to write it to ORC format, so you can skip the 2nd and 3rd steps above, and carry out the 1st and 4th steps only.

For our case, note that the output for the last query has a column applicant_race_1, which is displaying race ids instead of race names. The mortgage data also has a column called action_taken, which designates the action taken on each application (Accepted, Declined, etc). To display names instead of ids, I loaded two small CSV files (race.csv and action.csv, whose content is included at the end of this article). I then created two views race and action, so they can be used in joins.

val race = spark.read.format("csv").option("sep",",").option(
"inferSchema", "true").option("header","true").load("./dataset/race.csv")
race.createOrReplaceTempView("race")val action = spark.read.format("csv").option("sep",",").option( "inferSchema","true").option("header","true").load("./dataset/action.csv")action.createOrReplaceTempView("action")race.printSchema()

The last command, printSchema() prints the table schema. It is usually useful after you have just loaded a CSV file, and can check whether Spark has correctly interpreted each column (it usually does).

With these preparations, we can now ask for mortgage application count grouped by race and action.

val summary_by_race = spark.sql("select race.race, action.action, count(*) as count from mortgage inner join race on race.id=mortgage.applicant_race_1 inner join action on action.id=mortgage.action_taken group by race.race, action.action order by race.race, action.action").show(5)

As you can see, we are usin standard SQL syntax.

Finally, to save your work to an external file, use the following write command. Similar to the read command, this includes specifications for headers, separators, and output file name.

summary_by_race.coalesce(1).write.option("header", "true").option("sep", ",").mode("overwrite").csv("./dataset/mortgage_by_race_2020.csv")

There are a few caveats that are worth explaining. After the command is run, if you check on your disk, ./dataset/mortgage_by_race_2020.csv is actually a folder. It is only when you go into the folder that you see a CSV file with a cryptic name. That is the output file you want to (rename and then) use. For those who are curious as to why this is the case, Spark is a distributed processing engine, so it will run multiple processes in parallel, which is another reason why it is so fast. Usually, each process writes to a separate CSV file, which is why Spark creates a folder to host all the files it writes. The write option coalesce(1) tells Spark to coalesce all the files into a single file after all the processes finish. That is why there is only one file in the folder ./dataset/mortgage_by_race_2020.csv.

Running Queries with a 110GB Dataset

As a test of Spark-SQL’s processing power, I duplicated my original file 10 times to get an input file 110GB in size. This time, I kept the CSV (and ORC) files on an external hard drive, which would make read/write/query slower. Nonetheless, from the following screenshot, you can see data loading takes less than 1 hour, writing to ORC takes 76 minutes.

After ORC file is created, subsequent analysis is incredibly fast. It took only 50 seconds to go from loading the ORC file to finishing the first query!

Appendix

The content of action.csv and race.csv is included below, formatted as comma-separated values:

action.csv

id,action
1,Accepted
2,Not Accepted
3,Declined
4,Withdraw
5,Incomplete
6,Purchased
7,Preapproval Denied
8,Preapproval not Accepted

race.csv

id,race,real_race
1,Native,Native
2,Asian,Asian
21,Indian,Asian
22,Chinese,Asian
23,Filipino,Asian
24,Japanese,Asian
25,Korean,Asian
26,Vietnamese,Asian
27,Other Asian,Asian
3,Black,Black
4,Pacific Islander,Pacific Islander
41,Hawaiian,Pacific Islander
42,Guamanian/Chamorro,Pacific Islander
43,Samoan,Pacific Islander
44,Other Pacific Islander,Pacific Islander
5,White,White
6,Not Available,Not Available
7,Not Applicable,Not Applicable

Acknowledgment: I’d like to thank Dr. Ganesh Mani and Shubha Chakravarthy for introducing me to the FFIEC’s HMDA mortgage dataset, and for their continued support.

--

--