How to Clean Your Startup’s Data — Part III of IV

Carlos Kemeny, PhD
Weave Lab
Published in
5 min readFeb 4, 2020
Source: https://www.ellackcleaning.co.uk/warehouse-and-factory-cleaning

I believe we have all heard or said these familiar words: “I don’t trust the quality of our data.” Too many high-growth startups allow these words to be stated over and over like a corrupted audio file on repeat. If you find yourself in this unfortunate situation, where do you start?

The following is the third part of a four part mini-series on how to achieve a gold standard of data quality at your startup, with commentary on how following these tips have helped us at Weave.

Part III: Data Governance

Suppose your data warehouse is filthy and you decide to launch a campaign to clean it up. You might congregate the masses for a bounded period of time, perhaps calling it a sprint. You solidify ownership and assign tasks. Everyone does their job and at the end of this project, you stand together in awe, admiring the beautiful, pristine warehouse.

The major challenge then is figuring out how you will maintain the cleanliness of the warehouse. While you may have cleaned everything up during the sprint, the root causes that led to the poor data quality were probably not solved.

This is where data governance comes in.

What is good governance?

In my experience, good governance that leads to data integrity starts with purpose, people, and process.

1. Purpose

How do you ensure that data going into your warehouse has a purpose and a reason for being there?

It can be compelling to put everything into your data warehouse due to the unknowns and what you might want in the future. Just to be on the safe side, it is probably better to just put all data in, correct? Wrong.

The “all data in” strategy can have some serious unintended consequences related to data quality. For example, when you put all the data in your warehouse, this can lead to data overload and pretty soon no one knows anything about any of the data. And then, what happens when people start creating and sharing transformations from uncertified data — what a mess!

You should do it right the first time. Establish a purpose for what you are trying to achieve through your business intelligence operations and then prioritize the data that you will need to achieve those goals. If the warehouse is already dirty, do the same and then, plan on purging bad data over time.

The following is our purpose at Weave:

“We believe in the power of data to increase revenue, improve productivity, decrease cost, and improve the lives of our people. We recognize that data is merely a means to an end unless coupled with positive impact. We are deliberate about prioritizing data projects that have the most impact on the bottom line.”

We are tackling the most important challenges first (meaning those projects that have the greatest potential to increase revenue, improve productivity, decrease cost, and improve the lives of our people) and making sure that we bring in relevant and trusted data into our data warehouse that are aligned with that purpose.

2. People

Who are the people that have access to your warehouse? What permissions do they have by default? Who are the owners of raw data, transformations, and visualizations?

People governance doesn’t need to be hard or complex, but it does have to be clear. Data source owners, transformation and visualization creators, as well as consumers may all be assigned different levels of data permissions. Function-owned data, such as data from finance and human resources, will need to be protected.

People governance strategy is different for each company, so stakeholders, including information security and compliance, should be involved in determining the best path forward. Stakeholders will also provide requirements on an ongoing basis, which means that your people governance strategy might change every so oft.

Ownership is particularly important because there needs to be accountability for preserving the quality of your data. At Weave, we determined that data source owners are also owners of the raw data in our warehouse, function leads own the transformations and visualizations for their respective functions, and executive and board transformation and visualizations are my responsibility.

3. Process

What best practices and guidelines need to be followed to preserve data integrity? How will you train and monitor usage to protect against data quality erosion? How can you best enlist others to evangelize and defend your data quality objectives?

At Weave, we created best practices and guidelines across every part of the BI journey and continue to evaluate for weak spots where we can be better. Examples of best practices and guideline topics include the following:

1. Standardization of dataset and data flow naming conventions

2. Auditing and certification checklists, such as a list of to-dos when merging data across two functions

3. Transformation options and preferences

4. Visualization standardization rules, such as chart colors and axis labeling

5. Visualization ownership

6. Setting up alerts and publications

We created training and certification exams to ensure all data users understand and agree to the rules. We also set up dashboards to monitor compliance. For example, users are alerted when they have failed to include a dataset description or to follow the naming convention. This feedback loop is critical to creating a virtuous data quality cycle.

Because process monitoring and enforcement can feel like policing and administrative over-reach, it is important to enlist the help of your team to participate in defining, defending, and enforcing these processes. Peer to peer accountability is so much more effective than that of a centralized administration. If everyone wants to enjoy a clean data warehouse, they all have to make sure that others aren’t polluting it.

Conclusion

A clear and stakeholder-influenced governance strategy is important to preserving your data quality. Good governance starts with purpose, people, and process. As you develop a plan that works best for your company, it can have an immediate impact on data quality. After all, what good is cleaning your warehouse if it gets dirty immediately after a sprint?

--

--