List Tables & Databases in Apache Spark

Swaroop
4 min readJul 26, 2020

--

Tables (or views) are needed if you want to perform SQL like queries on data in Spark.

Databases in Spark

Tables exist in Spark inside a database. So, We need to first talk about Databases before going to Tables. If we don’t specify any database, Spark uses the default database.

We can see the list of available databases with listDatabases:

>>> spark.catalog.listDatabases()
[Database(name='default', description='default database', locationUri='file:/~/projects/pyspark-youtube/spark-warehouse')]

As we have not created any new databases, we only see the default database in the list.

We can also use Spark SQL to get the same information.

>>> spark.sql('show databases').show()
+------------+
|databaseName|
+------------+
| default|
+------------+

Apart from the ‘default’ database, there is also a global database global_temp which is not listed here. There can be tables created in the global database too.

When you start a Spark application, default is the database Spark uses. We can see this with currentDatabase

>>> spark.catalog.currentDatabase()
'default'

We can create new databases as well. This can be done only with SQL. There is no equivalent catalog API for this.

>>> spark.sql('create database freblogg')

And now, listing databases will show the new database as well.

>>> spark.sql('show databases').show()
+------------+
|databaseName|
+------------+
| default|
| freblogg|
+------------+

Create Tables in Spark

I have a file, shows.csv with some of the TV Shows that I love.

# shows.csv
Name,Release Year,Number of Seasons
The Big Bang Theory,2007,12
The West Wing,1999,7
The Secret Circle,2011,1
The Good Place,2016,4
Sherlock,2010,4,
Death Note,2006,1
How I Met Your Mother,2005,9
Friends,1994,10
Castle,2009,8
Elementary,2012,7

To create a view called shows from that, we will do

>>> tv_shows = spark.read.option('header', 'true').csv('shows.csv')
>>>
>>> tv_shows.show(2)
+-------------------+------------+-----------------+
| Name|Release Year|Number of Seasons|
+-------------------+------------+-----------------+
|The Big Bang Theory| 2007| 12|
| The West Wing| 1999| 7|
+-------------------+------------+-----------------+
only showing top 2 rows
>>>
>>> tv_shows.createOrReplaceTempView('shows')

That creates a table/view called shows in the default database.

We can create a view in the global_temp table as well as follows:

>>> tv_shows.createOrReplaceGlobalTempView('shows')

Tables in Spark

Tables in Spark can be of two types. Temporary or Permanent. Both of these tables are present in a database. To list them we need to specify the database as well.

>>> spark.catalog.listTables('default')
[Table(name='shows', database=None, description=None, tableType='TEMPORARY', isTemporary=True)]

Note that in the above list database is shown as None . This is because the default database is more metaphorical. The database by itself actually does not exist.

We can do this listing from Spark SQL as well.

>>> spark.sql('show tables from default').show()
+--------+---------+-----------+
|database|tableName|isTemporary|
+--------+---------+-----------+
| | shows| true|
+--------+---------+-----------+

To list tables from global_temp , we can do it in the same way.

>>> spark.sql('show tables from global_temp').show()
+-----------+---------+-----------+
| database|tableName|isTemporary|
+-----------+---------+-----------+
|global_temp| shows| true|
| | shows| true|
+-----------+---------+-----------+

Note that, it also shows the tables in default database when we do this.

That is how we can see the tables present in various various databases of Spark.

I have made a youtube video about this topic. Do check it out for some more details and examples:

Attribution

Spark Logo from https://commons.wikimedia.org/wiki/File:Apache_Spark_logo.svg

Used under License: Apache software foundation / Apache License 2.0 (http://www.apache.org/licenses/LICENSE-2.0)

--

--