LLMs, Columnar Lineage & Data Residency
--
LLMs are reinventing the playbook for what used to be challenging or even intractable computing problems. In this post, we’ll cover an example that should resonate with data teams everywhere — columnar lineage.
Before getting into columnar lineage, here’s a quick primer on the broader topic of data lineage. Data lineage is essentially the task of tracking the dependency closure (i.e., the set of all dependencies — direct or indirect) of every asset in the data operation. Typical data sources range from snapshots or change feeds from system of record stores (for example — the CRM store), telemetry of different types including usage clickstream from apps, data feeds from third party providers etc. Data processing pipelines, tend to be highly derivative processes, that create layers of derivative assets from such original sources to produce data that is suitably refined, aggregated and shaped in order to be exploited for different purposes. The format of these assets differs depending on the type of systems being used for such derivative processing ranging from SQL tables and file formats like delta/parquet etc. Despite the recently paradigms like Zero ETL, the reality of derivative pipeline processing is going to be around for a long time especially with large data operations which ingest data from a multitude of multi-generational systems that provide the bulk of source data.
Data lineage is important for two broad areas of concern:
- Privacy Concerns: Most privacy controls in the data space are based upon the origins of data, which dictates everything from how it must be secured, retained and to what purposes it can be allowed to be put to and by whom.
- Operational Concerns: Data lineage is pivotal for effective data operations. It enables proactive management of upstream changes impacting downstream derivatives, informs cost control and process efficiency, and aids in designing robust SLAs and data quality strategies. Furthermore, it expedites root-cause analysis of data incidents and helps determine which downstream datasets need updates when upstream issues are resolved.
At its heart, the problem of tracking data lineage at the asset level lends itself well to static analysis techniques. Modern data processing engines (batch or streaming) have declarative APIs or language support that should be sufficient to extract the necessary information to build asset level lineage graphs. Additionally, these engines also typically emit run time metadata (ex. Spark DAGs) that can be analyzed for the same purpose.
For these reasons, tracking data lineage on an asset basis is not necessarily a hard computing problem, although in real life, it presents substantial challenges due to the complex nature of running data pipelines especially in large organizations. This is often the case when lineage spans areas of team ownership & heterogeneous engineering infrastructure. Large data lineages experience constant evolution. These changes occur upstream with the evolution of sources, downstream with changing use cases, and in-between due to operational reasons. Therefore, effective data lineage tracking requires sophisticated tooling support in authoring, deploying, and orchestrating data processing pipelines. With a well-designed data engineering stack that is commonly adopted in the organization, such tracking can be reliably implemented. In IDEAs, for instance, lineage tracking is a first-class concern in the design of our data engineering fabric — Nitro. This allows us to automatically maintain and update full asset lineage for every produced data asset upon each deployment made by IDEAs crews or by any one of our partners who use the same platform.
Columnar Lineage
There are situations where simply knowing the lineage on an asset basis is not enough and we need a more granular understanding of the transformation being applied. A primary example is tracking the processing of personal data — i.e., data that can be associated with an individual, which is subject to considerably more stringent privacy controls and compliance audits. Such data can be transformed in different ways. It can be aggregated to a point where the output can no longer be associated with a specific individual, or it can be used to create derivative datasets where an individual can still be identified. Depending on the case, the controls that apply on the outputs can be significantly different.
Keeping track of this is especially important for meeting compliance obligations such as the General Data Protection Regulation (GDPR) act, a European Union regulation that governs how organizations handle personal data. Under GDPR, individuals have the right to have their personal data corrected or deleted. To fulfill these requests, organizations need to know all the places where the individual’s data is stored or used, which translates to tracking which columns or attributes in a dataset contain personal data.
By far the most common way to identify individuals in datasets produced in a data operation, is by using an identifier that can be linked uniquely to the individual. Unfortunately, identifiers are such a universal concept that can be applicable to just about any kind of entity (like an item in a catalog or a type of license). Identifiers can also be transformed to fit formatting requirements or mapped with key rings with other identifiers. So simply looking at a table with a column of identifiers is not indicative of whether the data in the table can be related to an individual or not. Therefore, it becomes vital to understand how data is transformed at every processing step from sources that may contain personal data.
Columnar lineage is lineage that is tracked on a columnar basis. We are tracking the transformation of individual data columns within a data processing job, as opposed to simply tracking the input and output assets of a job. Compared to asset-based lineage, tracking columnar lineage is (much) harder for the following reasons:
- There is a lot more to track. Every dataset has attributes. Columns may be added, removed, or changed more frequently than the introduction of a new asset, which means there’s simply more change to stay on top off in columnar lineage. The problem is simply bigger.
- Columns in a dataset can be subjected to different transformations. A single operation, such as a join, could affect different columns in different ways. A column can be one of several inputs to another column. This makes tracking columnar lineage more complex unless one can reason about the precise semantics of each step of transformation.
- Most popular batch and streaming compute engines offer a hybrid approach of declarative languages (SQL) or APIs (Spark) and the use of procedural languages (like Java, C#) to allow developers express custom transformation behaviors. This makes the process of inferring the semantics of the transformation at a column level an intractable problem in the general case. This is the same reason why the efficacy of static analysis tools is restricted for procedural languages to certain types of problems. To contrast this, as mentioned above, most of these systems offer job metadata that is rich enough to identify all the input and outputs of a job which is sufficient to track asset lineage.
For these reasons, despite being an area of active academic research, programmatically inferred columnar lineage is not considered to be sufficiently trusted source that can relied on exclusively in real world settings. The typical practice for handling GDPR obligations is to rely on manual annotation i.e., the developer declares which columns in the output of a transformation (including identifiers) are associated with personal data and this is backed with heuristic and ML backed controls to detect the presence of personal data. This situation would have continued to remain so, until late last year when OpenAI ushered in a new era of LLM based computing.
A new dawn
The arrival of LLMs into the mainstream, have changed the ground reality when it comes to tackling previously intractable problems. Static analysis for programming languages is yet another area where LLMs are rewriting the books on what is possible in a real-world setting. It turns out that models like GPT4 are more than capable of tracking the columnar lineage including when there is the use of embedded procedural code. And not only can it make the inferences, with the right prompt, it can also explain why it made the inference. When applied to the GDPR problem space, it is now a viable tool that can be used as both a preventative and an audit support aid, to support columnar lineage than what efforts using traditional static or dynamic analysis have yielded in the past.
It can also be a game changing difference to the way data teams execute under challenging circumstances. In previous blog posts (see here), we talked about the EUDB (EU data boundary) challenge where Microsoft has committed to ensuring that personal data associated with Microsoft customers based in Europe never leave the boundary. To respect this obligation, a data transformation job (for an approved purpose) will need to be refactored into component jobs which stages the necessary aggregation within the EU data boundary, and which will ensure that only fully aggregated data is allowed to be copied out. Doing this for production pipelines is expensive but can be overlooked as a one-time migration expense. However, consider what happens if the jobs are written for data exploration purposes as a one-off? A Data Scientist would have to develop these scripts in on one boundary and then refactor them to execute across data boundaries. In these situations, the cost and friction associated with refactoring is considerably more painful. An even more basic problem is simply determining whether a job written for ad-hoc exploration purposes is generating personal data in one or more of its many outputs, and accordingly determining whether data can be moved outside their boundary of origin.
These problems were the target of a recent LLM-based prototyping effort. The objective was to leverage GPT4 to programmatically rewrite & refactor a set of production scripts that consumed personal data and emitted out aggregated data. The targeted scripts were all authored by our Data Science team for computing macro-output metrics for business reporting — a core use case that is critical for decision making in a company like Microsoft. The refactoring was intended to translate each of these scripts from their pre-EUDB configuration into a set of three components that can be used together in a distributed pipeline which can co-ordinate their execution within and across the EU DB (in fact, not just EU DB, but any number of future data boundaries where such outward data movement of personal data may be restricted), while also ensuring that the constraints are respected. The three components that had to be extracted out from the original script were:
- A (parameterized) script that is meant to execute within each data boundary which handles the processing steps involving personal grain data up to its aggregation (such as a COUNT) that results in a data shape with a non-personal grain, which is written out as intermediate outputs. If the aggregation involved non-distributive operators like DISTINCT, then the code would have to be rewritten to use appropriate sketches (for example HLLs in place of a distinct count).
- Schema and metadata of the intermediate outputs which is sufficient to safely copy these assets out of each sovereign data boundary and allows these intermediate outputs from each data boundary to be correctly merged. This will be different based on how the aggregation in #1 is accomplished for the intermediate outputs.
- A script that is meant to consume the merged intermediate outputs from each data boundary and any other non-personal data sources from the original script, to produce the final set of outputs.
Interestingly, the selected collection of scripts was written in SCOPE (Structured Computations Optimized for Parallel Execution) — a scripting language developed by Microsoft for handling large-scale data analysis. SCOPE was originally designed as a hybrid of SQL and C#. It’s specifically designed to manage and analyze data stored in Cosmos which is an exabyte scale data storage and analysis system used internally by Microsoft that has been in existence for over two decades and that continues to evolve and innovate. Imagine a combination of Hive + Hadoop but with a whole bunch of extras. For many reasons that we will not get into here, this platform service is used internally by Microsoft teams, and is not available to the public, although as can be expected its technology continues to mesh up with those of external offerings and more popular open-source analogs. For this discussion however, the most pertinent takeaway is that GPT does not natively understand SCOPE as it does other popular languages (such as Python, C# & SQL). An educated guess would place GPT’s understanding to be limited to resources in the public domain which includes papers in computing journals over the past two decades. Here is a recently published paper to VLDB that also gives a nice overview of the evolution of Cosmos: [Link]
Under normal circumstances, attempting to update a static analysis tool to support a new language would have been a significant undertaking. Fortunately, with LLMs like GPT4, this is just a minor speed bump. With prompts, we simply need to pass in context in much the same way we would explain things to a developer who is new to Microsoft, who happens to have an excellent understanding of systems like Hive & Spark, declarative languages like SQL and procedural languages like C# or Java. Call out the pertinent differences and edge cases, include a few pertinent examples and the base model handles the rest.
A Recipe built to leverage GPT4
The prototype followed the following steps:
- Extract the catalog metadata of the personal data attributes on all the input data sets of a target script. These scripts consume data from production assets and all our assets have metadata around data taxonomy classification (that were manually annotated). Specifically, the metadata identifies which of the columns on the input data sets have the personal data classification including identifiers. Such identifiers in our taxonomy is referred to as EUPI (End User Pseudonymous Identifier). For the purposes of this prototype, we will assume that this metadata can be trusted to be correct and complete.
- Data transformation in SCOPE is done through a series of row set transformation steps. So, in the next step, the following information is prepared as context for the prompt:
– Relevant details around the nuances of the SCOPE language including the basics of how row set transformation is done including the type of statements.
– Details around data classification including the use of EUPI identifiers & how they could potentially be modified in the script.
– Details around how C# code is embedded in a script.
– Finally, the body of the script as context for the prompt. - With the above context, we design a prompt with the instructed task to infer for each row set transformation in the body of the script:
– The schema
– The subset of attributes which contain personal data or identifiers. This could be empty.
– The source row sets. There can be more than one (for example, if there is a Join)
– The actual statements associated with the transformation (which becomes useful as seen below). This is usually one statement, but SCOPE allows for the use of conditional pre-processing directives which are evaluated before the execution plan is generated. So, we can have more than one implementation of the transformation step in the script. - Once the GPT response is processed, the rest is straightforward. The output from GPT allows us to construct a directed graph of all the row set transformation steps and the associated schema and personal data produced by each step.
- This directed graph can be analyzed to identify those statements that reduce personal data to non-personal data. These transformations become candidates for how the original script can be broken up into #1 and #3 (& the basis of producing intermediate outputs for #2). For brevity, let’s call them target row sets.
- Once the set of target row sets are identified we compute two closures. Closure in this context simply means the complete set of direct & indirect dependencies:
– The closure of all upstream transformation steps in the DAG that led to the target row sets. Refer to this as the target closure. The target closure is effectively the basis of the script component that runs within each data boundary (i.e., the #1 output)
– The closure of all the output steps from the original job that includes the target row sets, but not their respective closures. Refer to this as the output closure. The output closure effectively becomes the basis of the script that runs on the merged output of the above (i.e., the #3 output). - For each target row set, determine the specific attributes which are the result of the aggregation, and if the aggregation involves a non-distributive operator like DISTINCT COUNT. If so, rewrite the statement to use sketches instead. This is handled again with a simple prompt. The information is encoded into component #2.
- Enumerate the statements in the closure sets to build the two scripts #1 and #3. When writing out target row sets, replace them with their updated version (using sketches) if needed.
The recipe is illustrated with a simple example below where the original script is computing the distinct count of users on different license types:
As mentioned in previous posts on this blog, prompt engineering with LLMs, while having high potential, is not a silver bullet. Below are some of the problem areas that need further hardening and left for future exploration:
- The body of the script can be much bigger than the max token limit for GPT4. The workaround is conceptually simple — chunk up the script and feed them in sequence with context from the previous chunks included in the prompt for successfully processing the current chunk. This works reasonably well. It does however require accounting for a couple of details to ensure that the chunk is self-contained for GPT to draw the right inferences. For example, any embedded procedural code relevant to the chunk of script may not be co-located and will therefore need to be explicitly attached to the chunk.
- References to procedural code in libraries are also problematic in the same vein as above.
- Beyond large scripts, we can have single statements which can exceed the max token limit. This typically happens in UNION operations which merge row sets. The workaround here is fortunately relatively simple — a single UNION statement can be refactored into multiple statements without changing the logic or introducing a performance regression.
- A script can have more than one target row set. It is also possible that a target row set is in the target closure of another target row set. Such cases cannot be addressed by breaking up a script into a 3-part pipeline described above. It requires multiple stages of intermediate outputs and data movement of aggregated data from and to data boundaries. Such cases were identified, but not handled by the prototype, although there is no technical reason why it cannot be remedied in a future release.
- Conditional pre-processing directives can be tricky. The few cases where the inference was incomplete were on account of the use of such directives. This just goes to underscore that leveraging LLMs are not foolproof, and the implementation must be hedged by safeguards. Incidentally, simple invariant checks such as checking to see if all the statements were accounted for in either of the component scripts (#1, #3) was quite effective at detecting anomalies.
Conclusion
LLMs have significantly improved our ability to tackle hard static analysis problems like extracting columnar lineage. By integrating prompt engineering with more traditional approaches, it is now viable to get good results in real world settings with just the base models. To reinforce this with a data point, the prototype which successfully refactored almost the entire set of target scripts (over 50), took up less than 2000 lines of code. As of the time of this writing, the prototype is queued up for implementation as a feature on our self-serve data preparation service — Pharos. This should come online even as teams across Microsoft race to meet our commitments around the EU Data Boundary, and transition into an operating environment which will significantly stress everyday productivity without the type of capabilities that GPT unlocks.
On a related note, I wanted to specifically acknowledge our friends from Microsoft Prague Development Center, who dropped in over a year ago as we were racing to upgrade our tooling for EUDB compliance. This team seamlessly dropped into our execution and swept up some of the gnarliest service redesign challenges associated with getting Pharos ready for cross boundary orchestration. With thousands of production assets that underpin numerous business critical data workloads, Pharos provides a simple UX driven interface for users to shape and push data so that it can be used in their analysis tool of choice in a compliant way. It was imperative that this continued to work for end users as we bifurcated and homed assets into their respective data boundaries. The crew hit it out of the park and added a whole bunch of related improvements along the way. This was exemplary cross organizational execution from Tomasz Szostak, Louis Mackenzie-Smith, Dana Ibraimova, Evren Ergen & Mikita Morau and their efforts will not only help us meet our compliance obligations, but also positively impact the productivity of data consumers throughout our eco-system.
Until next time.









