Simple ETL using Apache Spark & PostgreSQL

Arif Zainurrohman
Nerd For Tech
Published in
3 min readMar 14, 2021
Spark & PostgreSQL

Apache Spark is a lightning-fast cluster computing technology, designed for fast computation. It is based on Hadoop MapReduce and it extends the MapReduce model to efficiently use it for more types of computations, which includes interactive queries and stream processing.

And for this practice I want to prove how great apache spark.

  1. Import all Libraries that we need to use.

from pyspark.sql import SparkSession
from pyspark.sql import SQLContext
import psycopg2

2. Create Connection to PostgreSQL.

conn = psycopg2.connect(host=’localhost’, database=’postgres’,user =’postgres’,password=’postgres’)
cur = conn.cursor()

3. Load the data.

sqlctx = SQLContext(sc)
pop_data = sqlctx.read.csv(‘ratings.csv’)

4. Create table in PostgreSQL.

cur.execute(“””CREATE TABLE movie
(rating numeric NOT NULL,
count_movie integer NOT NULL
);”””)

5. Do Some Transformation, in this case I just use data with _c0 not equal with ‘userID’ and count movie groub by rating.

clean = pop_data[pop_data[‘_c0’] != ‘userId’]
a = [tuple(x) for x in clean.groupby(“_c2”).count().collect()]
b = ‘,’.join([‘%s’] * len(a))

6. And for the last insert the data into table, and don’t forget to commit.

q = “INSERT INTO movie (rating,count_movie) VALUES {}”.format(b)
cur.execute(q,a)
conn.commit()

In this case, I use 25000095-row data after cleaning.

image (by author)

Spark split the job with 2 jobs, a job with job id = 0 only create 1 stage and 1 task, and job with job id = 1 create 2 stages with 208 tasks.

image (by author)

Summary metrics for job id = 0.

image (by author)

Stage 1 for job id = 1 with 8 tasks.

image (by author)

Stage 2 for job id = 1 with 208 tasks.

image (by author)

For job id = 0 (load csv file), need 0,3s.

image (by author)

For job id = 1, do some transformation and insert data to PostgreSQL need 8s.

image (by author)

Result in PostgreSQL.

image (by author)

Conclusion

This practice outlined a basic Apache Spark ETL process using PySpark from a single source to a database. In my experience, with SSIS or ADF or using .sh file in RedHat maybe need more than 8,3 s, but it’s very interesting to compare with a complex case. In this practice, I only use 2 codes (1 code for cleaning and 1 code for transformation), and I really interesting to try the other transformation.

Reference

Create an Apache Spark ETL: A Simple Guide — Learn | Hevo (hevodata.com)

Apache Spark — Introduction — Tutorialspoint

--

--

Arif Zainurrohman
Nerd For Tech

Corporate Data Analytics. Enthusiast in all things data, personal finance, and Fintech.