Diary of an Architect Series — (2) Data Strategies for Gen. AI era

Filip Rais
19 min readSep 18, 2023

--

Generated by Midjourney (abstract depiction of data patterns)

This article is re-publish from the same article series on LinkedIn.

Once upon a time, there was a lot of data in the kingdom. That data was used in all kinds of different ways until the new data seer ascended to serve the people.

The seer was wise and seemingly knew about a great many things. People were grateful for such a strong entity to help them, but they also started to be unsure about the use of their data. They knew data was important because they cared about the data very much and they knew it had a value greater than gold.

Over time people of the kingdom noticed that the new seer did not know about events that happened in recent days (past September 2021 :-)). Also, the seer, as wise as he was did not know much about their kingdom. It was intriguing and so, they asked him about it.

The seer responded, “If there are tales or truths I’m unaware of, present me with documents, and I shall read them.”

So the people gave him documents, but the seer always forgot everything they gave him the previous day. And that wasn’t all, Seer got exhausted after reading just a few pages of documents, but there were so many more documents that needed reading.

Starting over every day and tiring quickly was no way to absorb the kingdom’s vast knowledge. The kingdom pondered, what could they possibly do next?

Chapter 2: LLM Data Strategies — Key Concepts

The short story above was an unusual start to this chapter, but I feel like such a simple story captures a key essence of our present situation. Let’s break it down more in terms of concepts that are mentioned in the story.

We have a few concepts to consider when it comes to data enablement for LLMs:

  1. Specific vs. General knowledge
  2. Context
  3. Memory

All three above concepts play a significant role in understanding and ultimately formulating working strategies.

Specific vs. General Knowledge

Current LLMs do have a considerable body of knowledge which makes the use of the models quite magical for many tasks that rely on the general and easy to implied context, such as language translation or education scenarios (e.g. explaining a well-known mathematical concept) and other general factual areas.

Translation from language to language is a general task and the guidelines or grammar rules are well-defined and easy to imply. In other words, the model can imply so much more context compared to what is given in the prompt.

That is quite helpful, but in tasks that require specific context or knowledge, there is a sudden drop in usability and accuracy when a significant body of information is not easy to imply from general knowledge. A simple email that needs to consider the previous conversations becomes difficult to get sound right without getting additional specific information. And that is just the tip of the iceberg. So, clearly, general knowledge is not enough.

A popular idea is that the solution is to train the model with this specific information. I will get back to this point later in the article when I talk about available patterns, but for now, let’s just say that training the model is only desirable in specific scenarios, and particularly training the model with our own personal interactions is probably not a good idea.

Those interactions are too dynamic (changing quickly) and may include a lot of inaccuracies or even wrong assumptions and information. It is the same for much of our structured and unstructured data. Direct training of the model with this type of data would probably cause other issues in the way the model responds and what biases it may introduce.

For those reasons, other concepts are available to help with this situation. Let’s move to the second concept which is LLM Context.

LLM Context

This part is referenced in the story as the elegant solution of getting the Seer to catch up by reading recent documents but it also touches on the limitations by mentioning the “exhaustion” he experiences during the reading. He can seemingly only process a limited amount of information. Processing current documents represents a very powerful way to get up to speed, but it has its limits. That is quite a fitting metaphor in relation to the LLMs.

I find the context discussion sort of forgotten in the mainstream of the LLM discourse today, however, it may be the most influential concept for data strategies related to LLMs.

The influence of this concept comes from two parts:

  1. Context size of current LLMs is the single most limiting element for data handling by the model
  2. The potency and flexibility that it brings to influence the model results

LLM models today have a context size limit measured in tokens. Tokens are usually parts of words or full words that can fit into ~4 characters.

This token limit influences how much information can the model process and generate for each prompt-response interaction. It should be noted that the response will also consume some of the context limit.

Model with 4000 token context limit (e.g. GPT-3.5) will be able to process approximitly 3000 words ( ̴5–6 pages of text)

The reason why LLM context is not a well-known LLM property is that it only starts to matter in intermediate or advanced scenarios which only a smaller percentage of users are encountering at the moment.

For our discussion here, it will be the most significant concept we will rely on and leverage while working around its limits to form relevant patterns.

Memory

The accompanying concept which naturally extends the LLM context is memory. Memory helps to retain longer contexts, but it also takes space in the context limit.

Short-term LLM memory spans a single conversation. Barebone LLM is single prompt-response focused, so it is not automatic that LLM keeps previous prompts and responses as part of each individual interaction. Such behavior is an application logic aka “memory” that is specifically built into applications such as ChatGPT, Bing Chat, and other tools.

In other words, memory stores all user queries including responses, and automatically includes them in all subsequent LLM interactions.

Memory is common application concept that LLM-focused frameworks (e.g. LangChain) are providing as native capability

Fairly uncommon is a concept of Long-term LLM memory. Such memory represents context retained over multiple conversations. Imagine a situation where the context of a previous conversation/s is needed to inform the current session. A good example would be e.g. when there would be a result of an analysis that was reached in the last conversation and some parts of it will be needed as a follow-up.

I believe, long-term memory especially, is a very potent strategy that will bring contextual and interaction-aware information to otherwise more fact-focused data retrieval strategies. For scenarios where it is important to know “how” the data was used and what actions were taken with the data.

Chapter 2: LLM Data Patterns in Practice

Now let’s jump to the meat of this chapter — the practical examples, patterns, and strategies.

It is important to mention that this area in particular is evolving and will need revisions as the industry progresses.

A quick disclaimer. We are not trying to describe interaction patterns with LLM (like prompting and prompt engineering methods), we will specifically focus only on patterns that enable external data with the LLM. Keeping it separate is a good way to give both areas appropriate space.

Core Data Augmentation Techniques

  1. Prompt & Context based techniques
  2. Training and Fine-tuning techniques

I will focus on both #1 and #2 in this chapter, but conscious of the length of the article I’ll formulate only basic principles and considerations for #2 and will keep the Training & Fine-Tuning topic for its own article in the series.

In terms of practical usability, #1 techniques are going to fit a lot of immediate scenarios and probably be more realistically viable short term.

Prompt & Context Patterns

All patterns in this group rely on LLM Context as the main data augmentation strategy often in combination with retrieval augmentation methods.

These methods can reach significant levels of complexity, especially with the growing amount of data that needs to be processed, but the core principle remains the same > bring all specific and task-relevant data into the prompt context. The complexity is therefore only in the way data are selected and processed before reaching the LLM prompt.

In other words, all these patterns inject data directly into the prompt and do it in more or less complex ways.

Sometimes people refer to techniques like zero-shot or few-shot training, however, zero-shot or few-shot are concepts describing how the model processes or interacts with the data. In contrast, we will focus solely on the part of how the data are retrieved and integrated into the prompt.

It looks like a level of separation between the data enablement/augmentation and model data interaction will be beneficial given how massive both of these areas are. Considerations for concepts like prompt engineering or chain of thought strategies will be definitely a great topic for another chapter of the series.

Patterns:

  1. No Retrieval
  2. Retrieval Augmentation (RAG — Search and Embedding)
  3. Query & Code Processing Retrieval (CodeRAG)

Reader note: I am famously bad at naming things :-). Please leave a comment to suggest options and ideas on how to name the patterns above.

No Retrieval pattern

Fig. (1): No Retrieval pattern interaction diagram

This pattern is included to complete the list and to represent the starting point which will be familiar to all users.

A typical representative here is “simple” prompting which relies only on information directly in the prompt. In other words, the user has to prepare and provide all the necessary information as part of the prompt text. Will be often paired with short-term memory to make sure the data augmentation may not need to be repeated for every prompt.

This approach is easy to start with through ChatGPT or many other chat experiences out there. The beauty here is that it can yield the same results as any other more complex solution, as long as the user is able to prepare the data correctly. This can be effectively leveraged for prototyping or one-off interactions.

However, preparing the data becomes a major hurdle and will be unsustainable for users to manage on their own before every request. It would significantly slow them down and that leads us to the next pattern.

Retrieval Augmentation pattern

Fig. (2): Retrieval Augmentation pattern interaction diagram

Retrieval augmentation suggests a specific process used to retrieve the data during the user interaction. That typically requires low response time and retrieval performance to make the strategy viable in direct user interaction.

There are several variants of retrieval augmentation that work with the same basic principle of retrieval which relies on outsourcing the relevant information search to an external system. Core to these strategies is also the ability to retrieve the search results based on a natural language search query.

Search-augmentation strategy

Relies on the integration of existing search strategies and systems like Azure Cognitive Search, Elastic, AWS Kendra, and other toolsets out there.

These solutions represent ready-made search strategies and in many cases combine several document search techniques.

In this case, an application relies on the LLM model being able to formulate a natural language search query for the search system which provides either a specific part of the content or full documents as a result.

A significant benefit of this approach is built-in advanced security that is typically able to filter data based on user identity. That comes in handy in large and complex ecosystems with potentially sensitive data.

There is a direct reference to this approach from Microsoft in their Azure Search OpenAI demo. I highly recommend exploring Microsoft’s demo and solution to gain a deeper understanding of the approach in general.

Fig. (3): Search Augmentation with Azure Cognitive Search — Credit: Microsoft

Vector embedding augmentation strategy

This approach focuses on representing the external data in a vector-encoded format, which can be natively connected with LLM-generated requests in natural language.

This approach can work in various scenarios and provide a significant level of control over the data, encoding, and retrieval, but will also require application-bound data and embedding lifecycle management.

The embedding approach is strong due to its ability to represent semantic meaning, consider additional context within a sentence or a body of text, and compare it to the retrieval query and its semantic representation. A simple example can be that if your query is using the term “monarch woman” embedding can match it with the term and references of “Queen”.

One forgotten aspect of using an embedding retrieval is that it is a “match”-first not content-first strategy. That means that if you embed your data and direct a query against the embedding you’ll get a vector and match confidence (a similarity score), not the content behind the match. Below is a simple representation of the embedding process which may be a good reference.

Fig. (4): Simple embedding creation and querying process diagram

As shown in the diagram above the response from the embedding space (2a, 2b) are vectors and their similarity scores with additional metadata. The metadata is usually used to retrieve the actual content based on the original document reference id.

A strategy for simplification of content retrieval in the embedding system is to store the actual content as part of the index metadata. That is very effectively used for quick access to the content and providing the application the ability to evaluate the response context or directly return snippets of matched text to the user. However, it also brings additional challenges (size of the index, redundancy, maintenance, and security of the data).

Both approaches search and embedding retrieval focus on unstructured data and rely on an external retrieval system component to be part of the application design.

I would like to also clarify that from a pattern perspective, the search and embedding strategies are similar, but both will have slightly different capabilities, so the choice needs to consider fit for the use case or scenario. To help with those considerations see the section “Use and Limitations” below for more details.

LLM Interaction Orchestration

As suggested in Fig. (2) the retrieval decision and subsequent retrieval query are leveraging LLM capabilities. Represented by the “Retrieval Decision” and “Request data” steps.

Commonly these steps are managed by an orchestration module as shown in Fig. (2). Orchestration is needed to manage intermediate steps together with the final LLM response to the user.

Orchestration can be done using custom strategies or leveraging an LLM-focused framework e.g. LangChain which already bundles the core part of the retrieval decision and data request as a single call. When I was experimenting with the approach using LangChain I had mixed results. I ended up preferring the tool/agent strategy that provides more control over the retrieval, but it was clear that framework support is beneficial.

This space is evolving fast which means that it will pay off to keep an eye on emerging strategies and progress to simplify the application logic by the use of ready-made strategies. If I were a betting person, I would bet on frameworks to drive commoditization and encapsulate large parts of the LLM interaction orchestration aspects into the reusable elements to be used. So relying long-term on fully custom retrieval might not be desirable in many situations.

Use Considerations and Limits

Both search augmentation and vector embedding are potent tools in the area of LLM information enhancement. Especially for large sets of unstructured data. However, their effectiveness is contingent on the specific requirements of the application and the nature of the data in question. Below are some considerations for the use of these approaches.

Search augmentation focuses on using a ready-made system with a set of advanced features (e.g. identity management and security) and usually a combination of search strategies that can help overcome some limitations of individual approaches. An example is exact matching which may be difficult to get from the barebone embedding approach. On the other hand, search systems may have a narrower focus compared to possibly more versatile embeddings.

The vector embedding approach may be used for various things. The same set of vectors can be used for factual information retrieval, sentiment analysis, or for summarization-type tasks. However, creating vectors at the appropriate rate to account for changes in data can be performance and maintenance-prohibitive. Another consideration is data security. A basic vector embedding response will not respect user access limits and will return data even if the user might not be entitled to see the data.

General limitations and considerations for both approaches include:

  • Data Quality
  • LLM Context size
  • Real-time data changes.

Data quality is one of the key aspects of data augmentation that significantly impacts model outcomes. Introducing inaccuracies, stale data or wrong data will likely steer the model in a wrong way.

Controlling and optimizing retrieval for use within the LLM context will be a key application consideration for many implementations because context token slots are a scarce resource.

Velocity and frequency of data changes significantly impact the choice of data augmentation strategy. Both search and embedding strategies will not be good candidates for fast-changing and dynamic data, due to the overhead for encoding or indexing. For high-velocity data, it’ll be key to avoid post-processing overhead as much as possible. Code-based patterns such as the one suggested in the next section may be a good candidate for dynamic data.

Retrieval augmentation strategies are currently the best way to get large unstructured datasets available for LLM interaction.

An interesting observation is that in many situations, it will make sense to use both search and embedding approaches in parallel to combine their potential. Core data is to be served through more security-aware search augmentation and the embedding approach can be plugged in for versatile augmentation of additional or specific types of data.

Query and Code Processing Retrieval

Fig. (5): Query & Code-based retrieval pattern interaction diagram

You’ve seen it here first :-) This is an emerging pattern for large structured datasets which is not being discussed very actively at the moment, but I believe it will dominate the structured data domain very soon.

Interestingly, a solid structured data retrieval pattern is hard to come by in the current paradigm that favors unstructured strategies — which is a real turn of tables in the Data & Analytics domain.

To be fully transparent this idea comes from fusing two existing approaches. One has been out there for some time already as a concept called “Query Engine”. This strategy existed before and was used very effectively in e.g. conversational analytics area before LLMs. It is now becoming more natively integrated with the LLM paradigm within frameworks like LlamaIndex.

That concept revolves around constructing (SQL) queries for structured data processing and retrieval. Which makes a lot of sense in combination with LLM — SQL is also a type of language.

Simply put query retrieval is asking the LLM to translate user natual language request to (SQL) code “language”

The second inspiration comes from ChatGPT Advanced Data Analysis (formerly known as Code Interpreter). That tool has shown how large datasets can be effectively and semi-autonomously scaled.

The important addition was the proxy environment which is used to manage the code interpretation. It is such a clever idea, I believe, originally meant for a slightly different purpose, but one that can be very effectively leveraged for data augmentation strategies.

Putting these two together we arrive at a pattern that generates its own code or query based on the user request and is also able to execute it and further currate as needed, that is why the independent execution part is quite critical.

An important prerequisite is something to the effect of the dataset index — in the structured world often referred to as data catalog. That will be key for being able to find data and understand the data landscape.

An additional data and analytics concept enabling this pattern is the data access strategy. Consistent data access will have a significant positive effect on this pattern and its application.

Use Considerations and Limits

The Query & Code Retrieval approach will work for a broad range of structured data sources — Databases, SQL-based layers (e.g. virtualization), Flat files, Semi-structured files (e.g. JSON), and more.

Code and language generation are key abilities of the current LLMs which makes this quite a natural direction to go towards.

However, there are quite a few hurdles to overcome as well. Query or code construction is a tricky business in terms of reliability and accuracy. In my own research focusing on ChatGPT Advanced Data Analysis, I have encountered quite a few inaccuracies in data representation that were not very apparent at first glance. That is why additional control mechanisms and verification may need to be included as part of the pattern evolution.

Another hurdle may be speed. Coding a small data curation pipeline in the background can bring the challenge of responsiveness for the user.

Finally, the success factor for this approach will be the availability of high-quality metadata and data description. To make it a viable solution it may need to be supported by an AI-enhanced data profiling and description system.

Training and Fine-tuning Patterns

Training and Fine-tuning come in discussions a lot. It is driven by the primal instinct that to have the model work with specific data it needs to be thought the data. There seem to be a lot of misconceptions about model training that typically enter the debate.

Does the LLM models need to internalize all the factual knowledge? Or do we want the model to be able to effectively reason about range of additional inputs?

If there is any indication then it must be the still emerging and sometimes unexpected capability that LLMs express. Models show the ability to do tasks seemingly outside of their primary training intent. One of the tasks is the ability to process external data. In the same way, human experts do not memorize and learn all external data for a given task while still being able to process and reason about various external content sources.

It is clear that neural network model training is not only about factual knowledge. If that was true any search engine would do. No, the reason for the model training is to also make the model understand patterns, be able to spot those patterns and process external information. To be able to reason and create dependencies between different terms and concepts.

I’ve mentioned the above to highlight that Training and Fine-tuning techniques are about improving the model’s ability to perform a specific task by giving it guidance, good-quality task-related information, and relevant contextual information. In fact, there are data and context that is better to avoid introducing to the model during training, because it may introduce factual inconsistencies, biases, and other negative effects.

So, contrary to popular belief, teaching models all relevant company data could be a big mistake. Not to mention the significant issues it would introduce for protecting the data distribution.

LLMs are quite a gossip. It will spill the beans on data it has available in a second. :-)

Fine-Tuning

Fine-tuning seems to be the second fastest way to achieve model data customization or specialization right after prompt and context techniques. It usually starts with a pre-trained model and provides further specialization and steering.

In terms of approaches and strategies, there are several promising ones many of which are based on Transfer learning. See the Fig. (6) below to see the hierarchy and some context.

Fig. (6): Fine-Tuning context and hierarchy

It helps to see the larger context like the hierarchy above. I have also highlighted a few interesting industry concepts or larger areas that represent good candidates, however, it should also be noted that there is much more that is going on in the research community and the concepts and candidates are still evolving rapidly.

I will not describe all the methods at this point, as previously mentioned let’s dive deeper into this topic in a separate article.

Use Considerations and Limits

Many AI use case evaluations I am aware of converged toward using prompt & context-based patterns and the rest is TBD on what exact strategies to use or whether it is a good use case for LLM in the first place.

I would also recommend checking Rachel Woods and her work in this space. In her recent posts on Twitter(X), she lays out good arguments and hurdles of fine-tuning models and considerations as to why to wait or avoid the tuning altogether.

It is hard to find production ready fine-tuning strategies that have fully progressed from research field into practical use.

The progress in PEFT (Parameter Efficient Fine-Tuning) strategies like LoRA (Low-Rank parameters) is a good indication of the current direction toward precision fine-tuning. It is likely to continue and keep improving toward even more targeted and focused approaches.

In any case, I believe we still have a few more steps to go until things reach the productized, repeatable, and fully deployable stage.

Chapter 2: Takeaways

The combination of Data & Gen. AI paradigms is one of the most consequential topics in the industry at the moment. One that I am sure will keep evolving next several years. The overview of data augmentation and enablement strategies above is hopefully a good initial step.

Takeaway #1: We have just started

There is still so much that needs to be developed and defined in this space, so I am 100% positive this is not the last time for us to delve into the topic.

The dynamic nature of this area also means that we should strive to formulate concepts and patterns with a larger intent to better understand what comes next.

Takeaway #2: Structured and Unstructured Data

An interesting observation is that unstructured data are more native to the LLM paradigm than structured data. That is important to understand because it may change many assumptions and expectations.

Few examples:

  • Leveraging typical DB-bound structured data together with LLM is currently harder than working with unstructured data (free text).
  • Mixing structured and unstructured data is possible and I would even say encouraged.
  • LLM is content focused not metadata-based — structured world is metadata-based.

The above examples significantly influence the status quo and it feels like the tables have turned, but understanding this shift opens up new exciting opportunities.

Takeaway #3: Training & Fine-Tuning for the right reasons

Understanding “goals of” and “intentions for” Training and Fine-Tuning will be critical. There may still be misconceptions about this space in the extended community

The prompt and context techniques will often yield the same results, be more flexible, and faster to start with than the process of fine-tuning the model and spending a considerable amount of time validating its proper behavior.

Thank you for tuning in to this chapter and I hope it was helpful for you as it was for me. It is a tremendous learning and momentum driver to walk the path out in the open together with you.

--

--

Filip Rais

Technologist and Architect, I meld creativity with pragmatism in the realms of Data & Analytics and Artificial Intelligence. #drive value from technology