Build a data catalog in 383 lines of Python

Published in

Pycob

6 min readMar 10, 2023

Data analysts, data scientists, engineers, and even business users frequently need to dive into the details of their organization’s data. Vendors in the space make data catalogs sound really fancy — for example, Atlan says “a data catalog is a collaborative workspace for diverse data users to navigate through the data ecosystem in an enterprise.” What? More simply, a data catalog is a list of my datasets and a description of what’s in them.

Here are some ways a data catalog can help people -

Finding the right tables & columns (data discovery)
Debugging data quality issues
Writing SQL queries

There are many data catalogs available already — both open source (Amundsen, DataHub), and commercial (Atlan, Alation, Collibra). But they often are very heavy weight and impossible or very difficult to customize for your own organization. And some of them cost upwards of $100k per year!

What if, instead, you could take a fully functional (although a bit basic) data catalog with less than 400 lines of pure Python and customize it for your organization? Let’s dive in to see how.

Links: Live App | Github | Youtube | Pycob | Google Slides

App demo

For our demo, we’ll use the Northwind database. According to the linked Github —

The Northwind sample database was provided with Microsoft Access as a tutorial schema for managing small business customers, orders, inventory, purchasing, suppliers, shipping, and employees. Northwind is an excellent tutorial schema for a small-business ERP, with customers, orders, inventory, purchasing, suppliers, shipping, employees, and single-entry accounting.

The tables and relationships are visualized below.

For our data catalog app, we have two main pages in the data catalog — one that shows a list of tables, and one that shows the contents of each table.

The first page is list of tables, along with some basic info, such as the table description, the row count, when the table was last updated, etc. It also has a link to see details about the table.

If we click on the Orders table, it brings us to the table detail page. We can see some basic info, like the row count (830).

Below that, we see a list of the columns in this table, along with some useful column metadata, such as the name, the data type, the min value, max value, number of null values, etc.

Finally, we have a few rows of sample data. This gives users a sense of what’s in the table.

Lastly, there’s a link at the bottom to “Edit Table Metadata.” The first part of the edit page allows users to edit table level information —

The second part allows users to edit each of the columns in a single screen. This is useful for a workflow where a data analyst wants to add information for multiple columns at once.

Finally, note the “Refresh” button at the top. This button will refresh the information in the Data Catalog by rereading the database and updating the tables, columns, and statistics.

Architecture

Let’s look at how this all comes together from a technical perspective.

We have four main data structures — Datasets, Dataset, Table, Column

Additionally, we have five pages — tables, refresh, edit, table_detail, and update.

Code

We’ll step through some important pieces of code next.

First, let’s look at thedataclass for Dataset. We have four fields, including the list of tables. We also have a couple functions — one shown below, that turns the tables into a DataFrame.

@dataclass
class Dataset:
    name: str
    readable_name: str
    description: str
    tables: list[Table]


    def to_dataframe(self) -> pd.DataFrame:
        records = []


        for table in self.tables:
            row = {}
            row['dataset_name'] = self.name
            row['dataset_readable_name'] = self.readable_name
            row['dataset_description'] = self.description
            row['table_name'] = table.name
            row['table_readable_name'] = table.readable_name
            row['table_description'] = table.description
            row['row_count'] = table.row_count
            row['table_last_updated'] = table.last_updated.strftime("%Y-%m-%d %H:%M")
            row['table_type'] = table.type
            
            records.append(row)
        
        return pd.DataFrame(records)

We then grab our cloud pickle, leveraging Pycob functionality that allows us to easily save and read cloud pickles —

# PyCob has a built-in cloud pickle feature that allows you to save and load objects to the cloud.
dsets: Datasets
try:
    print("Loading from cloud pickle")
    dsets = app.from_cloud_pickle("northwind.pkl")

Let’s look at the guts of some of the refresh code. We use some SQL commands (SQLite) to grab info about the tables and columns.

    for table_name in tables['name']:
        # Get table metadata
        
        # Get columns
        columns = pd.read_sql_query(f"PRAGMA table_info('{table_name}')", conn)
        
        # Remove the "Picture" or "Photo" column from the sample data
        columns = columns[columns['name'] != 'Picture']
        columns = columns[columns['name'] != 'Photo']


        page.add_pandastable(columns)
        
        sample = pd.read_sql_query(f"SELECT * FROM \"{table_name}\" LIMIT 10", conn)

Now that we have the data stored in a cloud pickle, and have read that pickle, let’s display the tables. add_datagrid does the dirty work and renders a beautiful table widget on the app.

def tables(server_request: cob.Request) -> cob.Page:
    page = cob.Page("Tables")
    page.add_header("Tables")


    if dsets is None:
        page.add_alert("Refresh required", "Error", "red")
        return page


    action_buttons = [
        cob.Rowaction(label="View", url="/table_detail?dataset_name={dataset_name}&table_name={table_name}", open_in_new_window=False),
    ]


    
    page.add_datagrid(dsets.to_dataframe(), action_buttons=action_buttons)
    
    return page

Next, let’s look at the meat of the table_detail page. Here, we make three calls to add_pandastable and we’ve got our table metadata, list of columns, and sample data rendered on the page.

    table_df = dset.to_dataframe().iloc[table_index,:].reset_index()
    table = dset.tables[table_index]

    table_df.columns = ["Field", "Value"]
    # Make the field names more readable
    table_df['Field'] = table_df['Field'].map(lambda x: x.replace("_", " ").title())

    page.add_pandastable(table_df)

    page.add_header("Columns", size=2)
    page.add_pandastable(table.to_dataframe())

    page.add_header("Sample Data", size=2)
    page.add_pandastable(table.sample)

    page.add_link("Edit Table Metadata", f"/edit?dataset_name={dset.name}&table_name={table_name}")

And finally, the edit page. Here, we have the table-specific values as well as a loop to get all the column-editable fields.

    with page.add_card() as card:
        card.add_header(f"Edit")
        card.add_header(f"{dset.readable_name} - {table.readable_name}", size=2)

        with card.add_form(action="/update") as form:
            form.add_formhidden("dataset_name", value=dataset_name)
            form.add_formhidden("table_name", value=table_name)
            form.add_formtext("Dataset Readable Name", "dataset_readable_name", value=dset.readable_name)
            form.add_formtext("Table Readable Name", "table_readable_name", value=table.readable_name)

            table = dset.tables[table_index]

            form.add_formselect("Table Type", "table_type", options=["Fact", "Dimension"], value=table.type)
            form.add_formtextarea("Table Description", "table_description", placeholder="Table Description", value=table.description)

            for column in table.columns:
                form.add_text(f"Column <code>{column.name}</code>")
                with form.add_container(grid_columns=2) as container:
                    container.add_formtext(f"Readable Name", f"column_{column.name}_readable_name", value=column.readable_name)
                    container.add_formtextarea(f"Description", f"column_{column.name}_description", value=column.description, placeholder="Column Description")

            form.add_formsubmit("Save")

Deployment

Now that we have the app running on our local machine, let’s deploy it to Pycob’s cloud hosting so other people in our organization can access it. Deploying is super simple and just takes one step once you have your API key. All you need to do is —

python3 -m pycob.deploy

And wait about 5 mins, and the app is live on the server and ready to go!

Using the Data Catalog app at your organization

There are many modifications and enhancements you may want to make to make this suitable for your organization. This entire app is 100% free and open source, and can be run locally within your environment or hosted externally through Pycob. There may be some changes you’ll want to make —

Customize the connection string. This one is mandatory — you don’t want Northwind’s data :).
Add more calculations to the columns and tables.
Schedule the updates to run automatically.