Data Governance & Pipelines

Analytic Alchemy
4 min readApr 17, 2024

Companies and clients are always looking for the next big thing and while AI is booming in the market, this is a reminder about the importance of the quality of data being used to create the next big machine learning model. Don’t get two steps ahead with machine learning models to just trip over bad data quality.

Data is used throughout organizations and has it’s own pipeline network where it’s housed, how it’s pulled, who has access to it, and where it goes to be used it whatever capacity. It’s used to make decisions for business growth, investments, and even employment. If this data is making these major decisions it should be correct, complete, accurate and timely, or there could be hard consequences.

Governance Defined

Governance on it’s own is defined as the process in which to act, control or strongly influence and/or to exercise continuous sovereign authority over something.

A Data Governance team will oversee this actions and processes, but it’s essentially everyone’s job to ensure the data is correct. The data governance team is there to put these processes into practice and to make decisions on the process and the data, but it’s everyone job to implement and be aware of how the data needs to be governed and formatted. There are a few different roles within the governance team and these could vary from company to company as well as titles be different but the idea and processes are usually the same. Here are some key components used in governance:

  • Data Quality Management: Ensuring that data is accurate, consistent, and up-to-date.
  • Data Security & Compliance: Implementing measures to protect data from unauthorized access, breaches, and cyber threats.
  • Data Privacy: Ensuring that data is collected, used, and stored in compliance with relevant privacy laws and regulations, such as GDPR, HIPAA and CCPA.
  • Data Stewardship: Assigning responsibility for data management to individuals or teams within the organization.
  • Data Lifecycle Management: Managing data from creation to deletion, ensuring that it is retained for the appropriate duration and disposed of securely.
  • Risk Management: helps mitigate risks associated with data, such as data breaches, data loss, or incorrect data analysis. By establishing controls and processes, governance helps identify and address potential risks.

So data governance is just that, the process and practice in which to ensure the data is managed with best practices and maintained during the lifecycle ensuring the data is: (see photo)

Data Governance Lifecycle

As you read previously one of the components to the data governance team is it’s lifecycle. Each lifecycle can be different dependent on needs. Some things within that lifecycle/pipeline will stay constant and that’s ensuring throughout the lifecycle of the data will maintain:

  • Accurate: Data governance ensures that data is accurate by implementing processes and controls to verify its correctness. This includes validation checks, data profiling, and data quality monitoring.
  • Reliable: Data governance aims to make data reliable by establishing standards and best practices for data collection, storage, and processing. This ensures that data can be trusted for decision-making purposes.
  • Timely: Data governance ensures that data is timely by setting guidelines for data capture and updating processes. This ensures that data is available when needed and reflects the most current information.
  • Consistent: Data governance ensures data consistency by defining standards for data formatting, naming conventions, and data integration. This ensures that data is uniform across different systems and sources
  • Complete: Data governance ensures that data is complete by defining rules and processes for data collection and validation. This ensures that all necessary data is captured and that there are no gaps in the data.

Data Pipeline vs Data Lifecycle.

The data lifecycle focuses on the entirely of the data existence from creation to deletion and everything in between. The in between is the data pipeline. The data pipeline is normally created by the engineering team on the back-end, but there are some ETL/ELT processes done by an analyst. Many analysts are probably more familiar with data pipelines, but the guardian or overseer is the data governance team with how the data is managed, stored and improved upon. But again, this can be everyone’s job as some of us are more intimate with the data, we can see things missed, we can bring it to the data governance team or engineering team to fix.

You can think of data pipelines as the interconnect web of data usage and flow from within this structured process of governance. It’s just as important for data engineers and analysts to follow the lifecycle data needs.

Conclusion

Overall Data Governance provides the organization and clients with the assurance that data is trustworthy, efficient and reliable to make standardizing practices for the data to remain consistent and promote transparency for the data operation all while providing accurate results with machine learning models and informing decisions. This allows for up-to-date data to make better strategic and operational decisions.

--

--