Incrementally Indexing documents with AzureAI Search Integrated Vectorisation
RAG pipelines need “living indexes”. There is usually more data that needs to be integrated to a RAG app dynamically e.g. via a blob trigger. This requires vectordb indexes to be incrementally updated with new data embeddings.
There are multiple ways to deploy such a RAG pipeline on Azure.
- Azure ML promptflow pipelines — You can create vector indexes with Azure ML promptflow. Vector index creation launches an automated RAG pipeline, which you can see under Azure ML pipelines. One can rerun the pipeline which incrementally updates the AI search vector index with new data.
2. Azure AI Studio playground indexing — This is the “bring your own data” no-code implementation of a RAG app. You can continously add more documents to the app.
3. Azure AI Search — integrated vectorisation (IV). The newest tool in the box is the integrated vectorisation support on Azure AI Search. Triggered by the “import and vectorize data” tool, IV creates a data source, an index, skillsets for data chunking and vectorization and finally an indexer.
The indexer can then be periodically rerun, to incrementally update the index or called out via a REST API call e.g. upon a new document being uploaded to a blob container which triggers an Azure Function calling the REST API…
Once created indexer can recognize the delta in the defined data source (e.g. new documents in the predefined blob container folder) and only index the delta and update the index with it.
Relevant documentation
- Quickstart: Integrated vectorisation (preview)
- Incremental enrichment and caching in Azure AI Search
- Integrated data chunking and embedding in Azure AI Search
- Skillset concepts in Azure AI Search
- Text split cognitive skill`
- Azure OpenAI Embedding skill
- Azure Blob storage trigger for Azure Functions