How we democratized data access with Streamlit and Microsoft-powered automation

Pieter Huycke
datamindedbe
Published in
8 min readSep 25, 2024

Within the community of data professionals, the term “data governance” often conjures up an image of a large, stuffy conference room filled with higher-level managers talking for hours about what the term “data” really means for their company. These discussions typically result in a 47-page long discourse on how data should be handled within said company. This compiled governance document is then released to the company’s data teams where, in the worst case, it will be avoided like the plague and called names such as “impractical”, “out of touch”, or “Ivory tower advice”.

One significant reason why such “Conference room”-attempts at data governance are doomed to fail is the lack of practicality: boots-on-the-ground data professionals don’t want to read 9 pages of definitions for each different data type the company has to offer, or ponder on whether “structured” versus “unstructured” data needs an entirely separate set of rules. Instead, what data analysts, engineers and other data consumers need is a practical, actionable approach to data governance. If governance efforts are too theoretical, detached, or overly complex, they simply won’t be adopted.

A data governance solution, at its core, should be an enabler designed to allow frictionless, easy-to-understand adoption of the governance rules detailed in the 47-page governance document.

In this blogpost, we will discuss how our team enabled a large-scale governmental organisation to perform self-serviced data access using Streamlit, an open-source Python app framework, and Power Automate, a SaaS offered by Microsoft. Before jumping into this story, we will first introduce some concepts needed to understand our data governance journey.

DoD: The Definition of “Data”

We will first introduce the concept of a “data product”, a concept derived from the data mesh principles:

A data product represents a collection of data bundled together for a specific reason, a purpose.

To make it more tangible, consider an analyst creating a data product titled “dashboard_expenses_q_4_2024”. The analyst might first identify several tables containing relevant information and subsequently bundle them together in this new data product. Even though each individual table might vary significantly in the type of information it contains, they belong together in this data product for the purpose of creating an informative dashboard about the governmental expenses of the last quarter of 2024.

Apart from the notion of a data product, we also defined who can interact with a data product by defining data product roles. These roles are again defined by their purpose, which in turn determines their abilities. For instance, the “data engineer”-role might be assumed by data consumers who wish to transform source data into intermediate and final output. For this purpose, it makes sense to grant the data engineer role read and write access to a data product. The purpose of a “validator” role on the other hand is to be able to inspect data for validation purposes. For this purpose, reading access would be sufficient.

A visual representation of a data product. The data product separates source from output data, and has a dedicated space where intermediate data is located. The lower part represents agents having access to the product in order to fullfill a specific purpose.

But how do we apply this governance concept in practice? Since roles are key attributes of a data product, we started by consolidating the roles for each data product in YAML files, which allow us to structure and manage this information effectively. An example of such a YAML file is shown below:

name: expenses_q_4_2024
description: Data product detailing the Q4 quarterly sales of the year 2024
people:
managers:
- email: t.swift@dep.gov.lc
data_engineers:
- email: w.wonka@dep.gov.lc
readers:
- email: j.austen@dep.gov.lc
schema_version: 1

This YAML-based approach to data access has several advantages:

  • Human-friendly due to its readable format
  • Versionable, as YAML files can be stored in an online code repository
  • Searchable

These YAML files form the basis for actually governing data access, as it provides a clear and easy-to-understand definition of who can do what within a specific data product. When a user wants to introduce a data access change for a specific data product, this hence translates into editing the specific YAML file in order to reflect the required change. This edited file is then uploaded to our organisation’s code repository where it will be used to update the data access model for that product.

Is YAML the best possible solution to manage data access? It depends. YAML’s simplicity is both its strength and its weakness. For straightforward tasks, such as granting access to a specific individual, editing a YAML file works well. However, when answering more complex questions like “Who worked as a data engineer on data product X this year?”, YAML can become limiting, as it relies on the version control system to track changes.

In contrast, a proper database could resolve most technical limitations. For instance, if data access was managed in a database where attributes like “data engineer” are treated as slowly changing dimensions, answering the aforementioned question would be significantly easier. Yet, managing a database introduces a significant increase in complexity compared to editing YAML files, which may not be worth the trade-off depending on your needs.

Importantly, while both YAML and database tech are well-known tools for developers, they are not for the average business user. For example, the business does not know how to upload a modified YAML file to a code repository, or what to do with potential merge conflicts occuring when two users edit the same file simultaneously.

Charon, the Underworld’s boatman

Recognising these challenges, we developed a user-friendly online application that abstracted away the underlying YAML editing process, allowing business users to manage data access without needing to interact directly with this templating language. Since we wanted a solution which would be easy to understand and maintain and allowed for quick iteration, we built our online app using Streamlit, an open source Python-based library.

Screenshot of our user-facing application

The resulting application, called Charon after the ghostly boatman handling access to the Greek underworld, provides many features to data consumers. Along with several other functionalities, Charon users can look up and check who currently has which kind of access rights to an existing data product, create new data products, edit existing data products, and look up existing users and their roles.

Since Charon was built to answer the specific data governance needs of its organisation, we developed it so that users are actively nudged towards data governance best practices and conventions. The key factor here is that this increases user compliance with the governance framework without expecting the users to read and internalize the entire data governance document. For example, it is straightforward to show an informative error message in Charon when the user provides input which is not in line with the governance rules. This kind of immediate and user-friendly feedback is not possible when working in a purely YAML-based setting.

In summary, switching from a code-first approach to a user-facing web application led to

  • a significant increase in user-friendliness
  • the ability for users to shape their own governance experience while complying with the defined governance framework
  • increased possibility for automation
  • a well-known, centralised landing zone for anything data governance related

You shall not pass!

- Gandalf warning of the Ballrog

Now that Charon allows every authenticated user within our organisation to make changes to existing products and create new ones, other challenges arise.

Does that mean that anyone can seize control of my data product?

What happens if someone is reading from my data product, but I don’t want that anymore?

These question indicate a rising need for control concerning data products. In line with the hierarchical structure typical to governmental organisations, it was quickly agreed upon that a higher level official should function as the data product’s manager and protect it from unwarranted access or misusage. This “manager”-role was subsequently incorporated in our data product role model. While the definition of the manager role is a step in the right direction, it does not answer how the managers will exert control.

Data Access as a Notification

In essence, we want to enable managers to approve or reject changes to their data products requested by users. Additionally, we want to provide easy-to-understand and immediate feedback on the manager’s decision to the requester. We decided to embrace the strong Microsoft culture in our organisation and opted for Microsoft Power Automate, a SaaS platform offering automation and orchestration of workflows and business processes.

In our organisation, we mainly leveraged two managed services offered by Power Automate: approval flows and instant notification. When a user requests an edit to an existing data product, Power Automate now grants the ability to first request approval to the manager of that product before implementing the change. This request for approval appears in the form of a Microsoft Teams notification, where a manager can press an “Approve” or “Reject” button in order to decide on the fate of the request.

Edited example of an approved request sent to a manager with the purpose of creating a new data product. Such a message is sent automatically via a Power Automate flow and is delivered as a Microsoft Teams notification. After the manager approves, the subsequent steps in the Flow are executed

When a manager approves, the change (which is essentially an edited YAML file, remember?) is propagated to our code repository where it is implemented. When this has been done, the requester receives a “Power-Automated” e-mail informing them that their change has been approved and processed. In summary, by leveraging Power Automate we provide a managed and automated workflow providing both control and clarity with respect to data governance.

Even though Charon consists of “simple” technologies like YAML, Streamlit and Power Automate, we see major adoption of its provided services and we hear users applaud its user friendliness. Our Charon experience points out that it’s never the technology that solves everything, it’s the people who are leveraging it.

Key learnings

If you have reached this point, it is evident to you that our data governance journey was a long and twisty one. To wrap up our story, we will distill our experience into a few key learnings:

  • Occam’s razor. Streamlit provides a ton of functionality out of the box. Since developers can start building away, this results in fast feedback cycles and ultimately end users being happy that their feedback is quickly implemented. Streamlit of course lacks the versatility and customisation options of a dedicated front-end framework such as React, but for an internal application which serves a dedicated purpose it is an excellent solution.
  • Technology educates. When managing their data access affairs, users would frequently encounter strategically placed hints and nudges gently pointing them to the best practices of data governance. After a while, users adopt these practices automatically even though they never read the actual governance document. Hence, technology allows for incremental learning and internalization of best practices as usage picks up.
  • Good design is hard. What makes sense for you as a designer or developer does not necessarily make sense for an end user. Think hard about what your users will think when using the new feature you are currently developing. Does it make sense that the feature should be enabled in the side bar? Is the help tooltip clear enough for someone with only top level knowledge?

Of course, we learned a ton more (when you see me somewhere, don’t ask me about application testing) but these are a few learnings we wanted to share with you.

Our data governance approach was heavily based on the concept of a data product. Would you be interested in adopting a similar approach in your company, or learning more about building data products at scale? If so, the blog post on the Data Product Portal could be a great starting point.

We hope you liked this blogpost, and would be delighted to hear any questions or remarks you have on our journey to better data governance!

--

--