Unlock your data: data sourcing step

Trefoil Analytics
5 min readJan 8, 2024

--

Data sourcing process

The data sourcing In the previous article, we started to unravel the complex web that is the data management process — the supply chain connecting data source and consuming. In the forthcoming articles, we’re going to delve deeper into four pivotal topics that orbit this process.
For each topic, we’ll dissect and elaborate on the sub-steps involved in every stage, giving you a comprehensive understanding of the tasks at hand. Then, we’ll spotlight the key players responsible for executing these steps, using the RACI (Responsible, Accountable, Consulted, and Informed) model. In addition, we’ll examine the data management capabilities needed to carry out the process effectively, spotlighting the essential skills and resources required. Finally, we’ll look at how innovative Data & AI solutions can boost these capabilities and make the process smarter and more data-informed.

So, let’s start with the data sourcing process.

The data sourcing process

Data sourcing process

Describing the sourcing process

he data sourcing process is the step in the data value chain that involves identifying and collecting data from various sources to make it available for further processing and consumption. This process can be quite complex, depending on the number of sources, the types of data, and the specific requirements of the data platform.

Here is a general overview of the data sourcing process:

1. Identify data need: Begin by determining the data needs of the organization, project, use case or a portfolio of use case. This includes understanding the types of data required, the level of detail, and the specific attributes needed for analysis.

2. Identify potential data sources: Next, identify the possible sources of data that can meet your requirements. This may include internal sources such as databases, applications, and spreadsheets, as well as external sources such as public data sets, partner data, or purchased data.

3. Evaluate and select the golden source: The golden source is the most accurate, reliable, and complete source of data for a particular data set. Evaluate each potential source based on factors such as data quality, timeliness, and coverage, and select the best one(s) to serve as the golden source.

4. Establish data ownership: Identify the data owner(s) responsible for maintaining and managing the selected data sources. Data owners are critical for ensuring the quality and security of the data throughout its lifecycle.

5. Define metadata: Metadata is the information that describes the data, such as data definitions, data types, and relationships between data elements. Documenting metadata is essential for understanding, integrating, and managing the data when unlocked from the source. It is also essential to map the business metadata to the technical metadata coupling the data elements to the corresponding data attributes (columns) in the actual databases.

6. Registering and classifying the data involves creating a structured data catalog that organizes and categorizes collected information. This step facilitates easier discovery, understanding, and access to data for users. It includes metadata documentation, data categorization, data lineage tracking, access control, and version control, all of which contribute to maintaining data quality, reliability, and security. By effectively managing data in this manner, organizations can support informed decision-making and ensure data is accessible and useful for their specific needs.

7. Determine DQ requirements: Refers to the process of identifying and defining the criteria and standards that your data must meet to ensure that it can be effectively used for decision-making, reporting, and analysis.

8. Establish data extraction methods: Determine the appropriate methods for extracting data from the source applications. This may involve using APIs, direct database connections, or file-based extraction processes.

Key players involved in the sourcing process

RACI Matrix for the sourcing process

The sourcing process is the very first step in the data value chain, its holds liberating the data from the source, it involves the following key players, including the data user, the data owner, data stewards, the engineering team, and the data quality team.

Mapping the Cloud Data Management Capabilities to the sourcing process

To successfully execute the sourcing step in the data management process, it is essential to possess specific data management capabilities. These capabilities are abilities that a data management team or function possesses to achieve the specific purpose or function.

Starting with building capabilities as per the Cloud Data Management Capability (CDMC) framework, or any other data management framework, can be beneficial. However, we believe these frameworks are primarily designed to provide control mechanisms to ensure compliance with internal and external policies, and they offer limited enablement capabilities. Enablement capabilities are crucial for automating and speeding up the data management process, unlocking data’s full potential, and promoting a data-driven culture. Let’s look at how the steps in the sourcing process map to the potential capabilities in the CDMC framework:

  • Identifying data needs: Capabilities not present!
  • Identifying potential data sources: Capabilities not present!
  • Evaluating and selecting the golden source: Capabilities not present!
  • Establishing data ownership: Addressed by CDMC 1.0 Governance and Accountability.
  • Defining metadata: Capabilities not present!
  • Registering and classifying data: Addressed by CDMC 2.0 Cataloguing and Classification.
  • Determining DQ requirements: Capabilities not present!
  • Establishing data extraction methods: Capabilities not present!

Enablement capabilities are central to fully leveraging the potential of data and fostering a data-centric culture. Unlike control, enablement focuses on developing capabilities that encourage strategic and efficient use of data to swiftly achieve business goals. This includes helping to find the right data for a specific use case, identifying suitable data sources, suggesting metadata when it’s lacking, proposing Data Quality (DQ) rules when they’re not present, and more.

We maintain that Artificial Intelligence (AI) can be used to both enhance existing capabilities and expedite the data management process.

Potential AI solutions

  • Identifying data needs: Using metadata descriptions of available datasets and previous data use cases, Large Language Models (LLMs) can analyze this information and recommend suitable datasets for the current use case. See our medium post here
  • Identifying potential data sources: Same as Identifying data need!
  • Evaluating and selecting the golden source: (LLMs)can help identify golden sources by analyzing metadata and data lineage, determining if an application is the first storage point for data, improving accuracy in the onboarding process to data platforms or marketplaces.
  • Defining metadata: LLM can help to extract business terms from technical term and the data domain where the data belong.
  • Registering and classifying data: Addressed by CDMC 2.0 Cataloguing and Classification.
  • Determining DQ requirements: From (multi-columns) profiling to DQ rules
  • Establishing data extraction methods: Eng. Capability!

The data sourcing process concludes when the data is transferred to the raw data zone, also known as the landing zone. This zone contains unprocessed copies of data “as-is” from various source systems. Following this, another sub-process begins: the data preparation process. Most modern data architecture approaches are currently based on what is known as the data lake medallion architecture. In the following article we will deep dive into this step.

--

--