Member-only story

Generate vector embedding from image text(optical characters)

Jay Reddy
Databracket
Published in
4 min readJan 21, 2025

Python Utilities to extract and load text from images into vector DB.

Image by Author

With exceptional possibilities and unparalleled functionalities, LLMs are taking the world by storm, thanks to vector databases and transformer architecture.

Data on the internet is getting scarce, organisations are focusing on generating LLM generated synthetic data to train their AI models.

Only time will tell how effective and useful these models are going to be.

Various data extraction methods have been emerged to address exploding data needs for Large language models training.

One of these extraction methods is to extract the text from the images. For example, understanding how much expenses were incurred by analysing the invoices, bills and snapshots.

This is originally published on Databracket’s Substack page. If you don’t have a medium subscription, please check there.

Extracting the information is not enough, we need to chunk the data and generate embeddings for the generated chunks and upsert the embeddings into vector store.

Basic to most advanced libraries and tools exists to fulfil this requirement. But my focus here is to keep the process as simple as possible while leveraging…

--

--

Databracket
Databracket

Published in Databracket

Use-case-specific findings, solutions, and implementation on Data Engineering, MLOps, DevOps, Web, AI, and Robotics.

Jay Reddy
Jay Reddy

Written by Jay Reddy

I write about Data, AI, Startup, and Entrepreneurship. Life without challenges and risks is mediocre. databracket.substack.com youtube.com/@data_bracket

No responses yet