How to Clean Your Startup’s Data — Part IV of IV

Carlos Kemeny, PhD
Weave Lab
Published in
5 min readJul 9, 2021
Source: Alex Welsh

I believe we have all heard or said these familiar words: “I don’t trust the quality of our data.” Too many high-growth startups allow these words to be stated over and over like a corrupted audio file on repeat. If you find yourself in this unfortunate situation, where do you start?

The following is the final part of a four part mini-series on how to achieve a gold standard of data quality at your startup, with commentary on how following these tips have helped us at Weave.

Part IV: Auditing and Certification

To ensure data integrity, any time data 1) comes into your warehouse, 2) is used in a transformation, or 3) consumed in a visualization, it should be audited and certified at every step. Does this seem somewhat cumbersome? Absolutely, but it necessary. This is why it is important that you nail down your governance strategy (purpose, people, and process) prior to engaging in this step for auditing and certification to even matter.

To illustrate where things can break down, I will use three examples.

1. Data source to warehouse: you establish connectivity from a data source to your data warehouse but do not audit or certify the data after it comes into the warehouse. How do you know that the connection was setup correctly and coincides with the data as contained in the source?

2. Raw datasets in the warehouse to transformed dataset: you make a number of transformations to a dataset to automate a process that is currently done manually or to simplify a process this is performed across platforms. How do you validate that the transformations have been setup correctly?

3. Visualizations: you create a visualization that requires business logic to be operationalized in a series of filters. How do you ensure that the filters have been setup correctly?

And this is just the start. While raw inputs can be more straightforward, depending on the complexity of the transformations and the lineage of data flows, the transformation step can get pretty messy and difficult to manage.

Ownership

While already discussed in a previous post, I can’t overstate the importance of ownership. Who owns what is one of the most important pieces of solving this puzzle. When there is clear ownership over the data auditing across each part of the data journey to visualization, then it is just a matter of following process. When ownership is lacking, everyone and no one is responsible for data quality. Finger pointing and “not my problem” syndrome are bound to follow.

At Weave, raw datasets are owned by data sources owners. Transformations and visualizations are owned by function leads. As we continue to grow, the ownership strategy will change but where we are at as a company and within business intelligence, this model works well. Eventually, function leads will provide more oversight and data engineers and visualizations specialists/analysts will participate in ownership.

Process

After ownership, comes process. Auditing/certification is not difficult but it requires attention to detail and the commitment to follow process.

We created a process that everyone can follow. Below are the steps for the auditing raw data in the data warehouse against source data:

1. Create a report within the data source. This report will include critical filters and date ranges, as well as all fields that are required for reporting.

2. Evaluate all raw data fields against data source reporting fields for total count, null count, maximum, minimum, mean, median, etc. Numeric fields might also include tests on counts of cells greater than 0, less than 0 and equal to 0.

3. Spot check random rows for data completeness.

4. Once all critical data has been reconciled, automate reconciliation by connecting reports to the data warehouse directly and monitoring for any discrepancies between raw data and reporting data.

Once datasets have been audited and verified, certification requires two approvers, with one approver typically being the function lead. In addition to the two approvers, business owners can also be added.

Certification is meant to instill confidence and as such, requires rigor. While not every dataset needs to be certified, the core datasets that provide information to executives and the board always need to be certified.

Now, a word about transformations. Even though reports can help reconcile independent datasets, oftentimes transformations join disparate datasets, which requires additional process steps. A good rule for auditing and certifying datasets that are created from data flows is to require all affected data source owners to participate in the auditing and certification process. This might not always be possible due to competing priorities and timelines, but at Weave, we require it.

Given that raw datasets should already be audited by the time they are used in transformations, you should also consider pre- and post- transformation verifications. For complex joins, it is a good idea to create temporary audit datasets to reconcile data pre- and post- join. For more simple transformations, you might consider creating fields that check for nulls, data types and values. Each data set is different so your verification steps will be unique but should follow similar principles.

Our recent experience with auditing and certification

Referencing only a use case from just a couple days ago, our team experienced again the immense value that comes from auditing and certification. We met with the finance team about an impact project that they wanted to do. After a one-hour meeting to review the purpose and objective, the finance team audited the raw data in the warehouse and verified its accuracy. Over the span of a day, we created two transformations that reduced a multi-day process to just over 3 minutes. Prior to sharing any content, we spent an additional 6–10 person hours auditing transformation steps and outputs in order to certify the final dataset. Oh, how glorious it was to create a reconciliation dataset and visualization that showed 0 discrepancies! So, in less than 1.5 days, we went from zero to certified for an impact project where data quality was paramount and can now be fully trusted. Ownership and process were clear. As such, the objective was achieved swiftly and effectively.

Conclusion

Auditing and certifying data will change the way that people trust your data. When ownership and process are defined and clear, your BI team will be empowered to provide an astounding level of data quality and impact.

--

--