Data processing with Spark: data catalog

Petrica Leuca
13 min readJul 22, 2022

In my previous article I’ve show-cased how Spark API can be used to read and write data and what the different types of saving data are available. Spark does have a SQL API and this is available when we work with the catalog (metastore).

Introduction to data catalog

So what is this data catalog we all hear about? The data catalog is just as it sounds: the registry of your data, it contains descriptions, definitions and relations (where there’s the case). When we look at databases, data catalogues come installed automatically and each database provider has the so called metadata layer where you can find relevant information about your data and your system. By having a data catalogue you are able to provide information to your data users and it is easier to make impact analysis on certain changes.

Spark comes with a default catalog in a non-persistent mode which is an Apache Derby database. This setup is recommended only for unit-testing and local use, because Apache Derby is single-user mode (does not support more than 1 connections at a time). But let’s try out a bit!

from pyspark.sql import SparkSession 
spark_session = SparkSession.builder.appName("My Spark ETL Session").getOrCreate()

--

--

Petrica Leuca

Well-rounded engineer, bringing data at your fingertips. I am not affiliated with any of the tools I write about, but I use them to build www.own-your-data.nl.