How to Clean Your Startup’s Data — Part II of IV

Carlos Kemeny, PhD
Weave Lab
Published in
5 min readJan 28, 2020
Source: Alex Welsh

I believe we have all heard or said these familiar words: “I don’t trust the quality of our data.” Too many high-growth startups allow these words to be stated over and over like a corrupted audio file on repeat. If you find yourself in this unfortunate situation, where do you start?

The following is the second part of a four part mini-series on how to achieve a gold standard of data quality at your startup, with commentary on how following these tips have helped us at Weave.

Part II: The Data Catalog

In my experience, one of the helpful drivers of data quality success is the creation and usage of a data catalog. From a Gartner Research article published in 2017 titled Data Catalogs are the New Black in Data Management and Analytics — “A data catalog maintains an inventory of data assets through the discovery, description, and organization of datasets. The catalog provides context to enable data analysts, data scientists, data stewards, and other data consumers to find and understand a relevant dataset for the purpose of extracting business value.”

While each data catalog will be unique and tailored to the needs of your business, creating one typically follows a similar trajectory.

Step 1: Document all of your core data sources

Core data sources are those data sources that are critical to the success of your business. Sure, every data source under the sun may seem critical, but you have got to start somewhere. Let’s say you are at a software company like Weave that is uber-focused on growth, Salesforce objects (such as Account, Opportunity, and Leads objects), Marketing sources (such as Google Analytics or Adobe Analytics) and Finance data sources would be a good place to start.

Step 2: Ensure that there is ownership over each core data source

Unless there is an owner for each data source, it will be very challenging to make any progress on the data catalog. If your organizational structure is such that those who know most about and have admin privileges to each data source do not report through the data org, this effort will be an additional lift for them and will need to be prioritized in coordination with the owners’ managers. Keeping all stakeholders engaged throughout the process can also ensure long term success, as data cataloging is not a one-time event but a continuous commitment. As such, good engagement and communication will better inform managers on future time commitment and resourcing requirements.

Step 3: Establish connectivity to all core data sources

By establishing connectivity to your data sources and creating data flows that output metadata, you can create dynamic datasets, visualizations, and alerts that help you govern your data catalog. At Weave, we have created data flows, datasets, and visualizations that list out every dataset, field, field type, and data definition in our data warehouse. Because we have established connectivity to our data source metadata, we don’t have to do one-time data dumps that are irrelevant by the end of the day.

Step 4: Record information about each field across core data sources, starting with criticalness, usability and field definitions

Now, information needs to be collected about each data source and field. Who is responsible for recording this information? The data source owner, of course. Depending on the number of fields per source and especially if this is a new exercise for your data source owners, it will be important to limit the scope of this step so that it is not overwhelming. What should be collected? While the breadth and depth can vary, minimum viable means the following fields are included: data source, field name, definition of the field, usability and criticalness.

Data Source: Provide the name of the data source.

Field Name: The exact field name as contained in the data source.

Description: Include an appropriate description of the field.

Criticalness: Mark “Yes” for fields that are critical; “No” otherwise.

Usability: 0 being not usable and 10 being usable.

A few comments about criticalness and usability — these are subjective fields that should be informed by current business needs. For example, you could target first those fields that are believed to be powering top level company KPIs. Usability is meant to be a proxy for three usability categories — low, medium, and high confidence usability. At Weave, we translate the score to each category in the following way: 0–3 is low, 4–6 is medium, and 7–10 is high. It is important to have a general understanding as to the perceived truthfulness of the source data and what work will need to be done to instill confidence in it.

Our Experience with this Approach

At Weave, we followed these steps to create a data catalog across various datasets, which included reviewing thousands of fields in about a month’s time. After confirming the data source owners that would be responsible for each data source, we distributed the following template and set a timeline.

I should note that each data source owner had other important priorities that they were working on at the time of launch, so data cataloging was not their primary activity. Time spent per person for the most intensive datasets was 2–3 hours per week, with a total of about 10 hours per person throughout the entire month. To make the work more collaborative and social, we organized four 2-hour working sessions. In these sessions, we got to know each other better, further cementing the type of data culture we want to have at Weave. Btw — Weave people are incredible — hungry, creative, and caring — so it was especially enjoyable!

The result of following this process: we now have a data dictionary. We found that over 75% of fields were not critical, which was just as valuable as discovering and documenting information about our critical fields. Additionally, we have embarked on a three-phase journey to achieve world class, enterprise grade data quality. Phase I is nailing down all of our critical, highly usable data. Phase II is working with our critical, non-highly usable data to resolve usability issues. Finally, Phase III includes addressing all of our non-critical data by encouraging the deletion of all non-highly usable fields. We will also rigorously evaluate whether non-critical, highly usable fields are worth keeping.

Conclusion

If you don’t already have a data catalog at your startup, you should start building one. The steps above were integral in building a data catalog at Weave. This process also led to many additional benefits, such as it has helped us to 1) clarify data roles, responsibilities, and ownership, 2) strengthen team relationships and 3) draft our strategic roadmap to achieve world class, enterprise grade data quality.

--

--