Member-only story
Generate vector embedding from image text(optical characters)
Python Utilities to extract and load text from images into vector DB.
With exceptional possibilities and unparalleled functionalities, LLMs are taking the world by storm, thanks to vector databases and transformer architecture.
Data on the internet is getting scarce, organisations are focusing on generating LLM generated synthetic data to train their AI models.
Only time will tell how effective and useful these models are going to be.
Various data extraction methods have been emerged to address exploding data needs for Large language models training.
One of these extraction methods is to extract the text from the images. For example, understanding how much expenses were incurred by analysing the invoices, bills and snapshots.
This is originally published on Databracket’s Substack page. If you don’t have a medium subscription, please check there.
Extracting the information is not enough, we need to chunk the data and generate embeddings for the generated chunks and upsert the embeddings into vector store.
Basic to most advanced libraries and tools exists to fulfil this requirement. But my focus here is to keep the process as simple as possible while leveraging…