How we use knowledge graphs to manage Snowflake RBAC at data.world

The basics

At data.world, we talk a lot about knowledge graphs and the difference they make in the data catalog space. They allow us to model data and knowledge in ways that other catalogs can’t, and they’ve helped us to solve many complex problems. Including those around Data Cloud Migration.

One such problem was around managing RBAC for our internal use of Snowflake. Although it first started as an experiment to see if we could, the use case quickly turned into something that has paid us dividends many times over and future ideas/plans have us excited about taking it even further. This blog provides step-by-step instructions on how we did it.

Managing Granular Access Control

It should be no surprise that we’re pretty big fans of Snowflake here at data.world.

It’s allowed us to scale in ways that other database solutions just aren’t able to match. And while we’ve had no issues on the storage side, there was one problem we kept running into as we grew as a company — access control.

Not to say Snowflake doesn’t have good access control. Actually, it’s quite the opposite. With Snowflake’s RBAC model, we’re able to finely tune our Roles and Privileges to exactly how we want our access control to be.

If you’re unfamiliar with RBAC (Role Based Access Control) here’s a quick explanation:

Snowflake’s RBAC allows you to control who has access to different parts of your data platform. You can create roles that define what users are allowed to do on a granular level. For example, you can create a “data analyst” role that allows users to query and analyze data, but not make changes to the data itself. Then, you can assign users to these roles based on their job responsibilities. So, if you have a data analyst on your team, you can assign them to the “data analyst” role. This ensures that they have the necessary permissions to do their job, while also maintaining the security and integrity of your data.

With how fine-grained we can get with RBAC, we’re able to make sure only the right people have access to the right things. Mapping all the roles and privileges properly can be pretty complex, but the process is still fairly easy to handle in Snowflake.

The problem we ran into was around viewing and maintaining these wide-ranging relationships.

The granularity that makes it so useful can also make it difficult to keep track of who has access to what data and, most importantly, how.

Understanding a complex RBAC model can be a challenge for technical admins, much less for large organizations. To illustrate a few challenges:

  • Complexity: Imagine a company with multiple departments and thousands of employees. Each employee has a specific role and requires different levels of access to various systems and data. The RBAC model for this company could be very elaborate, with many roles, permissions, and access levels to keep track of.
  • Multiple layers of roles and permissions: Consider a healthcare organization with different departments (e.g. cardiology, pediatrics, etc.) and multiple teams or groups within each department. There could be a top-level role that grants access to all patient records, but then there are sub-roles within each department that grant access to specific patient records or types of data (e.g. lab results, imaging reports, etc.).
  • Lack of consistency: RBAC models can use different terminology and concepts, which can lead to confusion and inconsistencies. For example, older roles and privileges may use a different naming scheme than newer ones.
  • Lack of documentation: RBAC models are not always well-documented, making it difficult for users to understand how the model is structured and how to properly assign roles and permissions. An organization might have an RBAC model that was developed years ago and has since been updated many times, but there is no documentation to explain how it has changed over time or what the current roles and permissions are.

We ran into each of these issues and needed a way to solve them. So we created a list of essentials that we wanted:

First: We need to automate generating and maintaining documentation.

  • If we need to handwrite or keep it updated themselves, it’s going to get outdated pretty quickly. So we need an automated process to write and update the documentation.

Second: There has to be a way to visualize our access control to better understand how things were connected.

  • When things are written out in tables, it’s hard to understand exactly how something is related to something else. A visual would allow us to understand things from a high-level perspective.

Third: Given any user and/or any object, we need to know what path of roles and privileges connected them.

  • Having it bi-directional, we can run audits on database objects, as well as users.

Lastly: Similarly, given a role or privilege, we need to know all the objects and/or users connected to that role/privilege (even when multiple hops/paths are involved).

  • This way we can audit roles/privileges to find unused or inconsistent definitions.

So we turned to the knowledge graph to solve our problem.

A brief interlude; the knowledge graph

Feel free to skip this section if you’re already familiar with knowledge graphs, but it seems important to give a small intro for users that may be unfamiliar with them.

You can think of a knowledge graph as a mind map or a spider web that connects pieces of information in a meaningful way. Just as a mind map helps you organize your thoughts and see how different ideas are related, a knowledge graph helps computers organize and understand information. By linking related information in a structured way, a knowledge graph helps computers better understand the relationships between different data points, enabling them to provide more accurate and insightful responses to queries.

With the unique graph structure, a knowledge graph makes the perfect model for the advanced permissions structures that are created in Snowflake.

Interesting note: You can use the following process on a general level to build anything into a knowledge graph no matter how complex the topic.

Building the graph

Step 1: Defining the model

You can think of the model as a sort of blueprint. We use this blueprint to define the structure of what our data will look like so we can follow that pattern when bringing it in. From data.world’s built-in model, we already have definitions for the common database objects, so we just needed to add them for the RBAC model.

Our objects are Users, Roles, and Privileges. We also explicitly defined the interactions between them.

Here’s what this looks like in Gra.fo, a modeling tool that makes this process really easy.

Snowflake RBAC model in Gra.fo

We extracted this model and added it to our data.world catalog.

Step 2: Grabbing the data

Thankfully Snowflake makes it easy to grab the information about roles and granted privileges. It can all be accomplished through SQL queries. And while you can use the queries to grab data extracts, we decided to take advantage of data.world’s advanced virtualization capabilities here. By virtualizing the RBAC queries we were able to get live views of the data. This means the information we have is always up to date and there’s no need for someone to maintain and update the data.

RBAC views virtualized in data.world from Snowflake

Step 3: Adding the data to the knowledge graph

Now that we had the model defined and the data gathered, we used SPARQL CONSTRUCT queries to add the data to the knowledge graph. For those new to the term, as SQL is to relational Databases, SPARQL is the query language for the knowledge graph. (Check out this blog for more information) The queries allow us to transform RBAC data into resources in the catalog to represent each role, grant, and privilege. While it’s a bit of work to set up the query the first time, once it’s set up it automatically maintains and updates the catalog for us.

Looking at the results

Now that we had all the data on data.world, we got to the fun part — looking through to see what we had.

Overview of the metadata collected in data.world (exact numbers changed for security reasons)

Starting with the basics, we immediately had catalog resources in the UI that showed us all the accounts, roles, and privileges we had defined.

We actually spotted quite a few old roles and unneeded privileges here that we were able to clean up immediately.

It was pretty wild to look at the number of privileges we had accumulated already. However, the best part of this setup is knowing that the numbers would automatically update and maintain themselves without us having to do anything.

Using Eureka Explorer to see the data

Next, we wanted to see how the data showed up in data.world’s Eureka Explorer tool. We typically use the Eureka Explorer to show data lineage, but by making a few updates to our knowledge graph model, we were able to display the lineage of Access Control as well. Here’s an example of us answering the question “who has the ability to create a database on our server?”

Eureka Explorer depicting the lineage of the CREATE DATABASE privilege with a focus on the SYSADMIN role

Notice how even looking at a simple use case like this, we start to see the complexity that exists in Access Control. While some users might have the CREATE DATABASE privilege through the SYSADMIN role, other users indirectly have the ability through the ACCOUNTADMIN role that is granted a USAGE privilege on the SYSADMIN role.

By just looking at the tables in Snowflake it would have been difficult to quickly understand how things are related, but with this view, even non-technical users can get an overview of how things are connected.

Let’s look at another more complex example. What if we wanted to know, given a particular role, all the users and privileges attached to it? While I could try to show you the Explorer view here, it would probably be too overwhelming so let’s take a look at the resource page in the catalog for the role.

Catalog resource page for the FOOBAR role showing the summary of connected roles, privileges, and accounts

With 1678 privileges branching from this one role, we can guess whichever accounts that connect to this role can do quite a bit. Also seeing that there are 4 roles attached on either side of this role, there are likely a series of hops to get from a Snowflake user to what they can actually do.

Bonus: Integrating into the larger knowledge graph

Before we finish the post, we wanted to show off just one more thing that we’ve been experimenting with internally. While it’s great that you can see an overview of Snowflake RBAC, we wanted to go a little bit further. What if we could integrate the information above into our larger data.world catalog?

What if we could weave in more pieces of knowledge to really see how everything fits together?

Well here’s what that can start looking like:

data.world Eureka Explorer summary connecting RBAC with Snowflake and Tableau metadata

Here we’re able to see the catalog linking Tableau objects like Workbooks and Views to the Snowflake columns that power them, and to the RBAC data that control access to the columns.

Why is this so powerful?

Because now the catalog can answer questions like:

I’m an admin for a Tableau Dashboard and the table that’s powering my view needs to be updated. Who has write access to be able to update the source table?

Or also:

As a Snowflake admin, I need to remove a user from the account (not server). What roles, privileges, and subsequent Snowflake objects did they have access to? And what downstream BI dashboards may be affected or become outdated?

This is just us connecting the RBAC data with data from Snowflake and Tableau.

If we can keep applying the same concept to build larger and more interconnected knowledge graphs, the questions we can answer are endless.

Conclusion & future plans

As you see above, RBAC can get really complicated as the Snowflake implementation scales up. However, we were able to use the power of the knowledge graph to make the model easy to understand and derive answers from. And while we showed off a specific use case, the general process of model, build, and load can apply to almost any knowledge area.

Hopefully, this inspired you with just an example of what knowledge graphs can do. We’ve solved quite a few of our other problems using this same approach and we should be following this article with more examples to come. This is a piece of our Agile Data Governance suite but if you’re interested in learning more check out how we handle it on a broader scale.

--

--