Improving LLMs: ETL to “ECL” (Extract-Contextualize-Load)

Chia Jeng Yang
WhyHow.AI
Published in
7 min readMar 16, 2024

--

One huge implication of LLMs transforming data processing into semantic processing is the transformation of Extract-Transform-Load (ETL) data processes to Extract-Contextualize-Load (ECL) semantic processes.

In the new age of LLMs, we’re extracting unstructured data from various sources, contextualizing it by extracting semantically meaningful and contextually relevant data from the raw document, then loading it into a structured knowledge graph, which we believe is the best way to store semantically contextualized data. RAG pipelines and AI apps then retrieve data so that they can give their LLMs the most relevant data.

In the traditional sense, adding context refers to the process of adding extra meaning to words in different points of time as adding context. Context is not added all at once, and also not added arbitrarily. Contextualization refers to the process of adding the right type and amount of information at the right point of time, so as to help structure a conversation in a specific direction, and at a specific depth.

For decades, ETL and ELT tools have helped developers move data from disparate data sources and mismatched schemas into a single place where data can be analyzed, shared, explored, etc. In ETL processes, developers extract structured data from various sources with many different schemas, transform it by cleaning, changing structure, etc., and then load it into a data warehouse. They can then pull that data into their active application data stores, explore it with BI tools, etc.

The age of accessible LLMs is ushering in a new set of requirements and workflows for data engineering. Here are some of the differences between then and now:

  • LLMs as the form function for data interaction, using natural language to interact with unstructured natural language documents
  • Data is unstructured — we’re no longer working with well-defined data, LLMs need structure on unstructured prose to navigate words in a meaningful manner
  • There is much more data to work with, especially now that unstructured data is in play, and this new under-tapped data in focus is qualitative, and mostly textual.

This is not to say that ETL/ELT and data analytics are any less important, but given the changes in the type of data, the amount of data, and what we need from our data, it’s time that we rethink processes like these. Indeed, words are likely the data that helps the LLM understand how to think about manipulation of quantitative data that we are more accustomed to manipulating. The process of refining the underlying meaning of the words we use to communicate, refers to the process of adding context.

Information retrieval is just the first step to enterprise adoption of LLMs. As more complex LLM processes mature in the future, like human-to-agent interactions, or agent-to-agent interactions, the underlying contextual logic and infrastructure will need to be built upon.

What does “context” mean in the context of LLM systems?

Improved Retrieval Accuracy: Enhancing the retrieval mechanism to more accurately identify and fetch relevant pieces of information from the knowledge base. This involves optimizing the query mechanism to better understand the context of the user’s request and match it with the most appropriate data.

Incorporation of Expert Knowledge as deterministic rules: Leveraging feedback from domain experts to define or refine rules and heuristics for more effective chunk and information retrieval and understanding.

Memory & Personalization: Incorporating memory and personalization into contextual optimization involves leveraging user-specific data and interactions to refine and adapt information retrieval and generation, ensuring responses are not only contextually accurate but also tailored to individual user preferences and conversational history.

Enter ECL — Extract, Contextualize, and Load.

ETL -> Extract-Contextualize-Load (ECL)

The two emergent frameworks within ECL also map neatly within the ETL framework. An example of an ECL process would be the use of document hierarchies. In the document hierarchies process, data is extracted from the underlying documents and chunked, the data is contextualized into a hierarchical layer of metadata, and stored in a knowledge graph. The knowledge graph serves as a semantic layer for more accurate information retrieval before the information is ever loaded into the LLM.

ELT -> Extract-Load-Contextualize (ELC)

Similar to how there is an ELT variant, there is also an ELC variant. The ELC process refers to some of the more innovative work going on with just-in-time and iterative knowledge graphs. The main use case here is in recursive retrieval, whereupon core concepts and ideas are fixed within a knowledge graph and information from various pages and documents are iteratively fed into the knowledge graph over time.

In this ELC process, there is no defined structure of these knowledge graphs, instead they are set up on a just-in-time basis, with a schema that is contextualized to the specific question at hand. This is similar in principle to the ELT process whereby transformation and schematization occurs after raw data has been loaded into a data warehouse.

A visualization of building structured knowledge representations on top of an unstructured knowledge base

What does the history of ETL tell us about the future of ECL?

Let’s review a brief history of ETL.

  1. Early Development (1970s-1980s)
  • Businesses began to recognize the value of leveraging data across different systems, but they lacked the technology to do this at scale.
  • There was heavy reliance on manual data extraction from various sources, and data was cleaned using batch scripts. This was all very time-consuming, error-prone, and difficult to scale.

2. Maturation & Tool Development (1990s)

  • Businesses began to leverage data warehousing solutions for storing historical data and performing analysis. This necessitated more sophisticated ETL tools.
  • ETL tools offered GUIs for building workflows and reducing complexity, and processes became more automated and reliable as features for data transformation, error handling, and logging became standard.

3. Expansion & Integration (2000s)

  • ETL tools began to support real-time data processing.
  • Tools expanded to support more sophisticated capabilities like metadata management, data quality, data reliability improvements, etc.
  • ETL tools also integrated more closely with data management and analytics platforms, streamlining the data lifecycle management process.

4. Big Data & Cloud Computing (2010s-Present)

  • ETL tools are scaled to support larger amounts of data and work with distributed computing frameworks like Spark and Hadoop.
  • Cloud providers and various data platform providers now offer fully-managed ETL platforms like AWS Glue, Google Cloud Dataflow, Azure Data Factory.
  • Given the accessibility and cost effectiveness of cloud-based data warehousing solutions, ELT processes have become much more standard.

If we look at the history of how ETL and tooling in this space has evolved over time, we can draw parallels with where ECL might go. Today, we are likely between Stage 1 and 2 of ECL tooling where the tools to develop contextual frameworks (i.e. knowledge graphs) have historically been extremely manual and cumbersome, and new emergent technologies like LLMs can help streamline and speed up the creation of knowledge graphs at a far faster rate.

Comparing the roadmap between ETL and ECL, we can speculate that we will likely see the following:

Stage 2 — Maturation & Tool Development:

  • The need for standardized features for data transformation, error handling and logging will see the rise of LLM-assisted, self-correcting knowledge graphs of underlying unstructured data

Stage 3 — Expansion & Integration:

  • We can expect to see real-time context optimization for agent actions, enabling agents to act with the most relevant context at all times. We can also see deeper integrations between context optimization workflows and existing real-time data movement applications.

Stage 4 — Distributed Agent to Agent Context Exchange:

  • As different domain-specific agents begin to interact with each other, managing context across multiple agents will require precise just-in-time context injection.

The transformation of ETL into ECL, powered by advancements in LLMs, marks a pivotal shift in data processing from a focus on structured data integration to engaging with unstructured data through semantic understanding.

We are on the cusp of realizing the full potential of Extract-Contextualize-Load processes, where the depth of data contextualization opens new frontiers for knowledge discovery and interaction. This excitement is underpinned by the prospects of streamlined knowledge graph creation, real-time contextual optimization, and personalized data interactions, promising a future where data is not just processed but conversed with, enhancing decision-making and innovation across industries.

WhyHow.AI is building tools to help developers bring more determinism and control to their RAG pipelines using graph structures. If you’re thinking about, in the process of, or have already incorporated knowledge graphs in RAG, we’d love to chat at team@whyhow.ai, or follow our newsletter at WhyHow.AI. Join our discussions about rules, determinism and knowledge graphs in RAG on our newly-created Discord.

--

--