Image generated by Adobe Firefly (not for commercial use)

A comprehensive guide to building a custom Generative AI enterprise app with your data

Madhukar Kumar
Published in
9 min readMay 23, 2023


I have fond memories of my college days.

A land before mobile phones, India of the 90s was a place where loose change could buy a landline phone call from roadside shops that also doled out an assortment of cigarettes, hot cups of chai, and samosas around the clock.

A single Rupee, equivalent to a hopeful wish in a fountain, was all it took to shout expletives across telephone lines — asking your compadres about when they are coming to college or demanding their presence at the coffee house, sometimes, guitar in tow.

My friends and I spent a lot of time sitting around at a local haunt near our college as regulars.

Ah, sweet memories.

Last year, when I went back to India, I met up with one of my old friends and decided to pay a visit to our college and the chai shop and re-live some of those memories.

The college was still there. So was the shop.

But they looked nothing like the picture I had in my memory from the 90s.

I felt a sudden pang of loss and a realization that memory, fickle as it is, is the marrow of our identity.

Our memories drive our actions.

Our memories make us who we are.

Our memories are what makes us — us.

This has been on my mind quite a lot lately as my interactions with Chat GPT-4 and other LLMs have surged in the last few days.

AI without memory is just a smart database. But AI with custom memory now that is a tool that can magnify your capabilities by an order of magnitude.

Let me explain.

Image generated by Midjourney — Imagine the child to be your LLM and the books to be your context data

Let’s say you wake up one morning to realize you have a prodigy of a kid that is a genius when it comes to reading books, comprehending, and responding intelligently based on their knowledge (let’s not worry about how the kid got there for now). But the kid has no memory or knowledge of who you or other family members are. As smart as the kid is, there is no chance that the kid can answer questions related to any information you ask about yourself or the family.

But what if you compiled all your family information into very well-curated books, asked the kid to read and internalize the information, and now asked it to answer questions related to you, your family, or even your ancestors?

This is effectively how Chat GPT and other LLMs can be used to build enterprise generative AI apps on custom data.

And believe it or not, there is a way to do this without putting all your custom data at risk of privacy and security.

Let’s see how.

Effectively, there are three steps to build a custom AI app, or in other words, make your own AI child. You first ask the mom out for dinner, then you….. sorry, got carried away.

  1. Get an LLM model and run it within your network.
  2. Curate your data and store them effectively in a database.
  3. Search and grab the right data/book and give it to the LLM as context and get back a response.

If you want to act on the information you get back from LLM, you can now do that as well (making your apps Agentic) and I will cover some of that below as well.

Let’s get started.

Step 1 — Running your LLMs within your network

As of writing this article, there are broadly two options. The diagram below represents how this can sit entirely within your network so that at no time is your data going out to the Internet.

Proposed high level architecture for an enterprise Gen Ai app

Option 1 — Get Open AI models through Microsoft Azure and run them within your network. Since this is not generally available to everyone at the time of writing this article, you have to sign up by filling out a form and wait to get access. Once you get access, you can set up your own Resource and then create an Open AI service.

In this model, you can both re-train the models against your data or you can get access to fine-tuned models. IMO, both of these are overkill. Keep in mind that the pricing is still by the number of tokens you use so very similar to how users are currently paying for the open cloud version of the OpenAI models.

Option 2 — Run open source LLMs from Hugging Face and run it on one of your servers that is in your network. Note that AWS now also offers running Hugging Face on their cloud through Sagemaker.

Hugging Face is an open-source community where different people, teams, and organizations regularly post their own LLM models. You can download these directly through code and then run them on your virtual machines. Here are some high-level steps.

  1. Install the Transformers Library: Hugging Face’s transformers library provides thousands of pre-trained models to perform tasks on texts. You can install it using pip:

pip3 install transformers.

  1. Next, download the models:

from transformers import GPT2LMHeadModel, GPT2Tokenizer

tokenizer = GPT2Tokenizer.from_pretrained("gpt2") model = GPT2LMHeadModel.from_pretrained("gpt2")

  1. Once the models have been downloaded, you can use them for various tasks, including generating texts.

input_text = "Once upon a time"

inputs = tokenizer.encode(input_text, return_tensors="pt")

outputs = model.generate(inputs, max_length=500, do_sample=True, temperature=0.7)


Note you can use the wildly famous and useful langchain libraries to chain multiple LLM models and even chain them so that the output from one becomes the input for the next model in the chain. This is also a good way to make your apps Agentic, aka make your AI apps perform actions based on responses from the Gen AI models.

Here is a list of some companies that are offering their own commercial Gen AI models that you can run within your network:

  1. Cohere —
  2. Mosaic ML —
  3. Anthropic — — This is not generally available yet, but a good one to keep an eye on.
  4. AWS Bedrock and Titan —

Step 2 — Organizing your data with Vectors

Once you have your models running within your network, you now need to think about how to add memory to it. As we saw, this is key to making any Gen AI app truly custom to your requirements.

In general, there are two ways to make your LLM custom to you.

  1. Re-train your LLM — This article will not get into the details of how to do this, but I should add this is a very expensive and time-consuming process as of the time this article was written. You not only need to have access to your dataset, you also need expensive GPUs to use libraries like pytorch. I have no doubt that in the near future, we may be able to incrementally re-train LLMs, but till that happens, personally, I am going spare my time and money to do what the rest of the world is doing, which is the second option.
  2. Search and Prompt — There are industry analysts that have already started giving this fancy names like Retrieval Augmented Generation (RAG) but put simply, this is primarily the strategy to be the librarian between you and your prodigy child. In a slightly more technical term, this strategy can also be referred to as Search and Prompt (SAP anyone?). The application takes the user prompt, does a quick search within the company database (typically semantic only but ideally a hybrid search), then sends that along with the original prompt to the prodigy child (your LLMs) to answer.
Search and Prompt methodology to provide context to your LLM models

I should add here that I have skipped the muddiness of adding a Vector-only database to the stack because, personally, my opinion is you need a hybrid search — exact keyword and semantic search in order to really generate highly curated and matched data for your LLM. If you choose to only use semantic search, say for a prototype or your own personal projects, there are two broad categories for vector embeddings and searches.

Vector-only databases — These are purpose-built only for Vectors, like a chef that only knows how to make one dish. There are some downsides to this approach, like no support for SQL, limited support for metadata and the biggest one — they cannot join other kinds of data within your organization. Some examples of Vector only databases

  1. Milvus —
  2. Weaviate —
  3. Chroma —
  4. Qdrant —
  5. Pinecone —

Vector libraries — These are Python libraries that can help you both create, store (typically in memory), and do semantic searches. These are mostly open source and have similar characteristics as Vector-only databases. If you are interested in using libraries, I would suggest one of the following in the list below.

  1. FAISS —
  2. ANNOY —
  3. NSLIB —

Enterprise-grade databases — This would be my recommendation for anyone starting new because if we are talking about enterprise-grade, it is important to take into account the following things, the most important being the first one -

  1. Is the database able to handle Vectors, SQL, and NoSQL data and do hybrid searches?
  2. Can you run the DB in the cloud, on-prem, or in a hybrid fashion?
  3. Does it have Disaster Recovery (DR), and can it scale horizontally?
  4. Does the DB have connectors/pipelines to bring data into it from different diverse sources?
  5. Does the DB have millisecond response times?

Based on all of these requirements, I personally would recommend SingleStore, but I am biased because I currently work at SingleStore. There are numerous options out there, and I suggest doing your own due diligence to pick the one that works for you.

Step 3 — Putting it all together with Search and Prompt

After you have successfully installed your own LLM and curated your relevant data into vectors and/or other data formats, you are now ready to start building the app. The diagram below describes how to do this using different libraries. I have personally become a fan of OpenAI as a product and use it to generate embeddings for all my projects. You can choose to use Transformers from Hugging Face as well.

Let’s look at a simple example of taking a pdf doc, converting the text into embeddings, and storing them in a database.

Complete picture of the Vector emnbeddings process

I created two apps using this methodology, one to read from Wikipedia and then pass this information to OpenAI to get back the response, and the second to ingest a pdf book and use OpenAI to answer questions about the book’s content.

Note that since this is publicly available, I am using the cloud version of OpenAI in these examples, but you can use the code and change the endpoints to talk to your LLMs in your private network.

You can check these out on GitHub here:

  1. Wikipedia as a memory generator → semantic_search/OpenAI_wikipedia semantic_search.ipynb
  2. PDF/Book reader →

If you have stuck this far, I am going to assume this is relevant to you and you are interested in learning more. My goal in writing these long articles is to help others cut short the time it took me to learn and do these experiments. So, if this article is helpful to you, drop me a direct message or follow me on LinkedIn and Twitter and feel free to engage, suggest, or ask.



Madhukar Kumar

CMO @SingleStore, tech buff, ind developer, hacker, distance runner ex @redislabs ex @zuora ex @oracle. My views are my own