Creating a Table from a Schema defined in YAML

Filipa
3 min readFeb 7, 2024

--

There is a lot we can do with YAML files. In this article, I’ll give you some ready-to-use methods to unleash the power of YAML in your tables’ creation.

---
name: customer_feedback
description: Customer feedback and comments
columns:
- name: customer_id
type: VARCHAR(3)
description: The unique identifier associated with each customer
- name: comment
type: VARCHAR(50)
description: Free text input
- name: created_at
type: TIMESTAMP WITH TIME ZONE
description: The timestamp noting when the customer create feedback entry

Neat right? Now let’s generate a table out of it. I’m using Amazon Redshift and let’s assume I have some data stored as parquet in an S3 bucket.

Me when my senior Engineer showed me how to define schema in YAML (photo by Sare in flickr)

We first create the TableMeta and ColumnMeta classes. We are just loading the YAML into a dict format but with some good Python OOP.

Let’s wrap the table meta in a function. The code above is not just fancy Python. We can now easily access the properties and functionality of TableMeta object.

path = "customer_feedback.yaml"
table_meta = TableMeta.from_file(path)

for column in table_meta.columns:
print(column.name, column.type)

# Output:
# customer_id VARCHAR(3)
# comment VARCHAR(50)
# created_at TIMESTAMP WITH TIME ZONE

print(table_meta.description)
# Output: Customer feedback and comments

We start to see a SQL command gaining shape. Let’s create our table. We will need database credentials to establish a connection to Redshift. “Always close your database connections afterwards” my wise teammate once said.

With the snippet above, we are initialising our database connection and running the below SQL command in Redshift.

CREATE TABLE IF NOT EXISTS public.customer_feedback (
customer_id VARCHAR(3),
comment VARCHAR(50),
created_at TIMESTAMP WITH TIME ZONE)

There we are, we’ve materialised our YAML in a table in our database. Let’s populate public.customer_feedback table with the data we have stored in S3. The IAM Role should have at least Read access to the S3 bucket (check with your DevOps which policy to attach).

In addition, we want to ‘comment’ columns with the provided description, this way documenting the table also in Redshift. That’s what the second method above is for. It’s extremelly convenient to have this documentation stored in our database:

  • In Database tools like DBeaver and DataGrip (JetBrains), simply hovering the cursor in a column will show you it’s description — your Data Analystis will thank you as this facilitates their exploratory analysis.
  • If your using a data catalog tool like DataHub, columns’ descriptions will naturally be picked when the software crawls your database.

Our last commands in the database:

TRUNCATE public.customer_feedback
COPY public.customer_feedback FROM 's3://my-bucket/data/filename.parquet' IAM_ROLE 'my_iam_role' FORMAT AS PARQUET SERIALIZETOJSON

COMMENT ON TABLE customer_feedback IS 'Customer feedback and comments'
COMMENT ON COLUMN customer_feedback.customer_id IS 'The unique identifier associated with each customer'
COMMENT ON COLUMN customer_feedback.comment IS 'Free text input'
COMMENT ON COLUMN customer_feedback.created_at IS 'The timestamp noting when the customer create feedback entry'

Finally, we can chain all the above operations and call them, for instance, in an Airflow task.

There is so much more we can do with schema definition in YAML. We can define value expressions to unpack nested data in columns, convert unix timestamp to a date, define distribution keys and dynamically trim incoming data if it exceeds what specified in the schema. Abstract away this complexity in the YAML file and keep your data application code lean to handle generically thousands of tables.

--

--

Filipa

Bite-sized posts documenting my journey as a Python Engineer in the startup realm.