How to view all databases, tables, and columns in Databricks

Kristo Raun
Helmes People
Published in
4 min readMay 4, 2021

Problem statement

Most mainstream database solutions have a simplistic way of listing all of the columns and tables of the database. A common standard is the information_schema, with views for schemas, tables, and columns.

Using Databricks, you do not get such a simplistic set of objects. What you have instead is:

  • SHOW DATABASES command for viewing all databases/schemas
  • SHOW TABLES command for viewing all tables within a database
  • SHOW COLUMNS command for viewing all columns within a table — which, importantly, only includes the top-level name for nested columns

This short tutorial will show how to get a Spark SQL view representing all column names – including nested columns, with dot notation – and the table and database (schema) they belong to.

We will use Python/Pyspark.

The following has been tested to work equally on Databricks 6.6 (Azure) and Databricks 8.1 (Community Edition).

TL;DR: skip to the bottom for the full script.

We can do it for you:) See our data and analytics services.

Imports & functions

First, we import StructType, which we will need in order to obtain nested columns.

Imports

Next, we define two recursive functions.

The first one requires a StructField as input (e.g. all of the StructFields of a schema), and it returns either the column name (if it is a first-level column), or a list of column names (if it is StructType, i.e. a collection of nested columns).

Function 1

The second one is a helper function (copied from here) for flattening the list of nested columns into a single level list.

Function 2

List of tables per database

We will first create a list of all databases:

Databases list

And we will iterate over this list to create a sql view, per database, of all the tables contained in the database:

Databases list 2

Example output of this type of view for an example database called aws:

Tables per database

The isTemporary flag is important because just running SHOW TABLES on a database will also list all temporary views. We can use isTemporary = false to only return actual tables. This is what we will do in the following snippet: unioning all of these database views into a single view, which will contain all databases and all tables.

Tables list

Output:

All tables

List of Columns

In the last step, we will iterate over every table, obtaining its schema. We will then iterate over every item in the schema, creating a list of columns. Finally, we create a view based on the database, table name, and column name information.

Columns list

The resulting view allColumns will include all struct StructTtype nested fields in dot notation – see the payload fields from the example:

All databases, tables and columns

Full script and conclusion

As mentioned above, this script works well in at least Databricks 6.6 and 8.1 (the latest at the time of writing). It was tested in both smaller and larger instances, with the biggest execution time being one minute for a workspace with a lot of tables with a complex schema and row counts around a hundred million. If your environment has billions of rows, it may take longer for the script to run.

Another note, and an important one: the current view is not dynamic, i.e. it is not updated in when new objects are inserted, or changes are made to the databases/tables/columns. You will need to run the script again.

Finally, the output could be more descriptive – e.g. including metadata, such as the column type, table partitioning, or size of the table on the disk. You could also include row counts of all tables, but that would get quite expensive for larger tables and probably considerably increase the execution time.

Summary of future improvements:

  • Dynamic view – updating when schema changes
  • Add metadata, such as:
  • Column type
  • Table size on disk
  • Row count (optional)

Full script:

Leave a comment with your thoughts – is this useful? Do you see a need for any other improvements? What tools do you use for data analytics?

--

--