Why Your RAG Application Needs ETL*

5 min readJan 10, 2024

RAG (Retrieval-Augmented Generation) models have proven to be an invaluable tool for businesses leveraging Generative AI models (like OpenAI and Claude) to generate personalized answers and responses using their own data. While it’s great to ask ChatGPT (or OpenAI) a question where it generates responses using billions of trained parameters, it is an even better experience when you’re able to provide these models with additional context from your company’s internal data and have those models provide deep contextual answers to questions and research that would have taken your team days, weeks or months to put together.

Vector Databases in the mix

The popularity of RAG has had a direct correlation with the increased use of Vector databases as organizations building RAG applications at scale depend on vector databases to store their data in a format optimized for highly efficient search. Vector databases are great at storing, and more importantly searching for key information needed to make decisions. In a RAG scenario, vector databases shine because they are able to return critical information very quickly due to a mix of vector embedding and superior search algorithms.

In a simplistic way, vector databases can be likened to data warehouses (like Snowflake or BigQuery) and RAG models likened to advanced BI tools (like Looker or Tableau) albeit with much better speed.

The Power of Contextual Data:

Imagine you ask your RAG model about a specific customer. Instead of a generic response, it delves into their purchase history, support tickets, and even possibly previous email threads, crafting a personalized reply that surpasses expectations. By feeding RAG models information like:

- Customer information: Customer information from Salesforce and other CRMs, support tickets from Zendesk, chat interactions from Intercom
- Clickstream and Event data stored in Elasticsearch
- Operational data: Sales records, inventory levels from MySQL

FYI this is a similar scenario with a client that we worked with.

Most organizations have their data stored in multiple locations within their enterprise — research files and critical documents in S3/Dropbox/OneDrive, operational data in Postgres/MySQL as well as external tools (Mailchimp, Sendgrid, Zendesk, Shopify etc) and customer data is stored in Salesforce and other CRM tools.

In order for this company to get full customer context, there needs to be a considerable amount of ETL (Extract-Transform-Load) where all of these data are molded together logically before being written to their final destination (which in this case is the vector database).

In keeping with the comparison to data warehouses, these myriad of data need to be joined and transformed into fully contextual data before being loaded to the selected vector database in order for these organizations to get optimal results when passing data to their RAG models.

With that being said, there are some considerations that the organization has to account for when building for a RAG application instead of a BI tool

1). A scalable Extract-Transform-Embed-Load (ETEL) framework
2). An orchestration and scheduling tool

The image below illustrates how an end-to-end flow would look using the example above and is an exact logical flow that we deployed to a customer.

1). The ETEL Framework: Just like a traditional ETL workflow, the logical design and physical implementation requires deep exploration, discovery and analysis of all data and respective sources. The extraction workflows could require capturing data at differing times and intervals based on their respective sources and speed. For example, customer and transactional data being extracted once a day at the end of the day while event data is extracted much more frequently (every 10 mins).

In the transformation step, data is sectioned, joined, aggregated one or more times within and across all sources before it is loaded to the final location which is optimized for querying, analysis, visualization and in our case RAG.

A critical additional step which has to happen in a RAG scenario is the embedding of the transformed data. The data has to go through a chunking and embedding phase before it can be loaded into the final vector database collection or index.

2). An orchestration and scheduling tool: After a considerable time has been spent building out the ETL logic to combine data from multiple sources, enterprise companies now have to worry about making sure each and all of the ETL processes are triggered to run when they are supposed to run and in the order that their supposed to run. Additionally, they have to build fault tolerance and “circuit breakers” for process continuity and to ensure that only correct and complete data makes it into their vector database.

A scheduling tool like Airflow works just fine and even has native integrations for some of the more popular vector databases.

Introducing Context Data

Creating and maintaining an efficient ETL pipeline can be complex. That’s where Context Data steps in. Our platform takes the hassle out of building and deploying data platforms and ETL processes specifically designed for Generative AI applications. We eliminate the need for manual coding and configuration, allowing you to focus on what matters most: enriching your data and unlocking the full potential of RAG models.

With Context Data, we’ve created a hybrid data platform which allows users to:

- Connect to multiple external sources (MySQL, Postgres, S3, Salesforce)
- Connect to multiple vector databases
- Perform cross platform ETL or transformations
- Schedule recurring ETL jobs to your vector databases for up-to-date data
- Run RAG and search queries on your data

Reach out and schedule a demo so we can explore how we can build your ET[E]L platform and framework for your organization.

You can also watch our quick demo video to see how our platform works.

Why Your RAG Application Needs ETL*

Vector Databases in the mix

The Power of Contextual Data:

Introducing Context Data

Written by Jide Ogunjobi