Catalog Tables in Redshift: Enhancing Governance of Your Data Warehouse.

Practical strategies to improve data efficiency and control.

Alice Thomaz
7 min readJan 28, 2024

With the evolution of our DBT project in 2023, we recognized the essential need to enhance the governance of our Data Warehouse. In this article, I intend to explore the role of system catalog tables in Redshift. These tables are crucial for effective environment control, playing a vital role in maintaining and governing the data that fuels our system.

By breaking down our approach into two topics, we observe that some system tables are entirely accessible, without presenting incompatible types or functions. However, for those that are incompatible, our team has developed specific procedures. These procedures enable the daily snapshotting of source data, establishing a solid foundation for generating insights and providing more effective control over critical information for our project.

Considering the importance of these catalog tables in the context of Redshift, it becomes clear how their implementation not only facilitates environment maintenance but also strengthens data governance, ensuring the consistency and reliability essential for the ongoing success of our Data Warehouse. By providing a comprehensive and structured view of metadata and objects in the database, these tables serve as a central directory, offering essential information about schemas, tables, views, and functions.

>> Compatible Tables

The highlighted tables below operate smoothly with conventional SQL functions, such as CREATE TABLE and INSERT INTO, allowing their conventional use in DBT. Consequently, we incorporate these tables in DBT conventionally, using them as source data and materializing them as tables, thus reflecting the current state of the environment consistently.

The following examples will be presented in the standard format, including the table name, a brief description, the columns used by the team, the Github link with the DBT source code, and the link to the official AWS documentation.

pg_class_info

A Redshift system table that provides the creation date of tables in the cluster. Essential for monitoring older tables in the testing environment that could be deprecated.

  • reloid: Table ID.
  • relname: Table name.
  • relnamespace: Schema ID.
  • relcreationtime: Date and time of table creation.

GitHub Link — DBT: pg_catalog_pg_class_info_source

AWS Link: PG_CLASS_INFO

pg_namespace

A Redshift system table that provides the ID and name of schemas present in the cluster. Essential for correlating data with pg_class_info.

  • oid: Schema ID.
  • nspname: Schema name.

GitHub Link — DBT: pg_catalog_pg_namespace_source

AWS Link: PG_NAMESPACE

svv_datashare_objects

A Redshift system table that provides information about datashares, including their respective schemas and objects. In the Afya ecosystem, we maintain an AWS account per business unit, making datashare management crucial for the company’s routine processes.

  • share_type: Datashare type, whether it is OUTBOUND or INBOUND.
  • share_name: Datashare name.
  • object_type: The type of the specified object, such as schemas, tables, and views.
  • object_name: Name of the object with the schema, for example — schema.view, schema.table.
  • producer_account: The ID of the producer account that shared the data.
  • producer_namespace: The unique identifier of the data-sharing producer cluster.
  • include_new: The property specifying whether any tables created in the future, views, or UDFs in the specified schema should be added to the datashare. This parameter is only relevant for OUTBOUND datashares and specifically for schema types in the datashare.

GitHub Link — DBT: pg_catalog_svv_datashare_objects_source

AWS Link: SVV_DATASHARE_OBJECTS

svv_table_info

A Redshift system table that provides detailed information about the tables present in the cluster. We primarily use it to monitor environment storage, control table sizes, and their distribution.

  • database: Database name.
  • schema: Schema name.
  • table_id: Table ID.
  • table: Table name.
  • encoded: Value indicating if any column has defined encoding compression.
  • diststyle: Distribution style or distribution key column, if the distribution key is defined.
  • sortkey1: First column in the sort key, if the sort key is defined.
  • max_varchar: Size of the largest column using the VARCHAR data type.
  • sortkey1_enc: Compression encoding of the first column in the sort key, if the sort key is defined.
  • sortkey_num: Number of columns defined as sort keys.
  • size: Table size, in 1 MB data blocks.
  • pct_used: Percentage of available space used by the table.
  • unsorted: Percentage of unsorted rows in the table.
  • stats_off: Number indicating how outdated the table statistics are - 0 is up-to-date, 100 is outdated.
  • tbl_rows: Total number of rows in the table. This value includes rows marked for deletion but not yet cleaned.
  • skew_sortkey1: Ratio between the size of the largest non-sort key column and the size of the first column in the sort key, if the sort key is defined. Use this value to assess the effectiveness of the sort key.
  • skew_rows: Ratio between the number of rows in the slice with the most rows and the number of rows in the slice with the fewest rows.
  • estimated_visible_rows: Estimated rows in the table. This value doesn’t include rows marked for deletion.
  • vacuum_sort_benefit: The maximum estimated percentage improvement in query performance when you execute vacuum sort.

GitHub Link — DBT: pg_catalog_svv_table_info_source

AWS Link: SVV_TABLE_INFO

>> Incompatible Tables

The highlighted tables below do not operate with conventional SQL functions, such as CREATE TABLE and INSERT INTO, as they exhibit incompatible types or functions. Thus, we have created procedures that enable the daily snapshotting of source data, and we run them via the pre_hook of a conventional source model in DBT.

The following examples will be presented in the standard format, including the table name, a brief description, the columns used by the team, Github link with the table’s DDL, Github link with the procedure and DBT source code, and the link to the official AWS documentation.

Note: It is important to use the DDL to create the table in Redshift before executing the procedure for the first time.

pg_tables

A system table that stores information about the tables in the database. We use it to track the user acting as the table owner, enabling control to associate production data with a default user rather than a specific team member.

  • schemaname: Name of the schema containing the table.
  • tablename: Table name.
  • tableowner: Name of the table owner.

Github Link — DDL: tb_pg_tables

Github Link — Procedure: update_tb_pg_tables

Github Link — DBT: admin_tb_pg_tables_source

AWS Link: PG_TABLES

pg_views

A system table that stores information about the views in the database. We use it to track the user acting as the view owner, enabling control to associate production data with a default user rather than a specific team member.

  • schemaname: Name of the schema containing the view.
  • viewname: View name.
  • viewowner: Name of the view owner.

Github Link — DDL: tb_pg_views

Github Link — Procedure: update_tb_pg_views

Github Link — DBT: admin_tb_pg_views_source

AWS Link: PG_VIEWS

svv_redshift_columns

A system table that lists all available columns in the database. We use this source to control column encoding, optimizing data compression through Vacuum, and to monitor the distkey and sortkey configuration to ensure efficient distribution. For a deeper understanding, I recommend reading the article “Best practices with Amazon Redshift: Architecture, organization and performance optimization”.

Additionally, it is crucial that all tables fed by DBT are documented, and this tracking is facilitated by the “remarks” column. By filling out the schema.yml of the project, Redshift tables are automatically commented, and it’s necessary to enable the persist_docs field in your project’s dbt_project.yml for this to take effect. More details about this field can be found in the article “Practical Guide to DBT: Organizing the project folders”.

  • database_name: The name of the database where the table containing the columns exists.
  • schema_name: The name of the schema of the table.
  • table_name: The name of the table.
  • column_name: The name of a column.
  • ordinal_position: The position of the column in the table.
  • data_type: The data type of the column.
  • is_nullable: A value that defines if a column is nullable.
  • encoding: The encoding type of the column.
  • distkey: A value that is true if this column is the distribution key for the table and false otherwise.
  • sortkey: A value that specifies the order of the column in the sort key.
  • remarks: Documentation for the column.

Github Link — DDL: tb_svv_redshift_columns

Github Link — Procedure: update_tb_svv_redshift_columns

Github Link — DBT: admin_tb_svv_redshift_columns_source

AWS Link: SVV_REDSHIFT_COLUMNS

svv_redshift_tables

A system table that lists all available tables in the database. We use it both to catalog the type, whether it is a view or table, and to check if it is documented in the “remarks” field, as mentioned in the previous topic.

  • database_name: The name of the database where the specified table exists.
  • schema_name: The name of the schema for the table.
  • table_name: The name of the table.
  • table_type: The type of table. Possible values are views and tables.
  • remarks: Documentation for the table.

Github Link — DDL: tb_svv_redshift_tables

Github Link — Procedure: update_tb_svv_redshift_tables

Github Link — DBT: admin_tb_svv_redshift_tables_source

AWS Link: SVV_REDSHIFT_TABLES

>> Catalog Model

I will add to the GitHub repository, in the folder of this project, an example of a DBT modeling that integrates the data from all the tables mentioned above, aiming to develop effective cataloging and control of the Redshift environment.

Github Link — DBT: dc_data_catalog

Developing a monitoring process for your environment is crucial to ensure effective governance of stored data. As the environment expands without proper control, it becomes more challenging for the team to deal with the lack of governance in legacy data. Therefore, it is vital to maintain active control to address discrepancies along the way.

When working with the standard Redshift catalog tables, we faced a significant challenge with tables incompatible with conventional SQL functions. We found no documentation or support to guide us in this task.

While we received recommendations to migrate to compatible tables, certain specific information required their usage. Through joint efforts of my team, we developed the logic through a procedure that met our needs. Thus, I hope this solution can be helpful to other teams, preventing them from facing the same issue.

If you have suggestions or comments about the content, please feel free to reach out to me on Linkedin.

Versão em Português: https://medium.com/@alice_thomaz/1e43eeb0e0bc

Github Project: Project 1: Redshift Catalog

--

--