Michael Kibenko
NI Tech Blog
Published in
16 min readMar 19, 2024

--

From Idea to Reality: ChatGPT-Powered Dynamic Search with MongoDB

I remember the day when ChatGPT emerged as a tech industry sensation. In those days we eagerly dived into experimenting with ChatGPT, conducting comparisons with other AI models. Our mission was to identify the most impactful use case for our products, with a strong desire to introduce a game-changing feature that would not only benefit our users but also leave us, the software developers, in awe.

My team is responsible for the CMS platform and infrastructure development, used by our company’s employees and other developers, and for quite some time, we have been thinking about the creation of a dynamic search engine capable of tailoring data to the unique needs of individual users. As ChatGPT’s capabilities continued to evolve, our confidence in turning this ambitious vision into a reality grew stronger.

Imagine the convenience of working within a CMS platform where, instead of manually wading through a sea of entities, you can simply employ an application-level, bot-style assistant to precisely articulate your search query. For instance, you can say, “Find the page title containing the word ‘best’” or “Retrieve the last 10 products created by John Doe.” With this intuitive approach, you receive results with relevant data inside each accompanied by a direct link to the corresponding entity page.

In this article, I will walk you through the journey of developing the alpha version of this dynamic search engine using ChatGPT, shedding light on the pivotal research efforts that played a crucial role in its realization.

Let’s start from the product’s point of view

Product reliability: Unlike your everyday GPT or assistant that relies on a prompting console (which allows for a few hiccups here and there and lets you correct AI mistakes by re-prompting), our users don’t exactly throw a party to meet the AI model behind the scenes. They care about asking questions and getting the right answers. As developers, we should bend over backward to deliver results that echo with users’ experience. It’s like cooking their favorite meal — they already know what it should taste like. So, to win the gold medal of trust from our users with this feature, the key player is the quality of the response.

User free speech: If we want to make our dynamic search as popular as morning coffee, we have to let our users chat freely and naturally. We can’t expect them to ask, “Could you please return the titles where the title value contains ‘best’?”, because, let’s face it, most users don’t really care about the existence of the “value” field in the “title” entity. They’re more likely to ask, “Could you please return the titles containing ‘best’?” Here, our model should play detective and understand that the “value” field represents the entity value for this specific entity. So, if the user doesn’t ask for a specific field, our smart model should use the ‘value’ field for the query like a seasoned pro.

Question Types:
First we’ve defined all the possible questions our users could ask in order to understand the user expectations for every response type

  • Value Search: Consider a scenario where a user is seeking entities based on certain values. They might say, “Could you please return the titles containing ‘best’?” or “Could you please return the subtitles created by John Doe?” or even, “Could you please return the subtitles created by John Doe last month containing ‘top’?” In these instances, our model should efficiently deliver a limited number of entities based on the user’s specific request.
  • Value Count Search: This is when a user is on a mission for some aggregation based on a value. They might ask, “Could you please return the length of each title containing ‘best’?” or “Could you please return the average length of each title containing ‘best’?” In this case, our model should whip up a limited number of entities, but with a side of aggregation.
  • Count Search: Here, a user requests a count based on a specific query. They might say, “Could you please return the count of titles containing ‘best’?” or “Could you please return the count of titles created by John Doe last month?” In these cases, our model’s job is to provide an exact count without limiting the number of entities, focusing solely on delivering the ‘count’ field.
  • Not Relevant: Let's face it, we can’t exactly control user inputs, so users may ask off-topic questions like “How to bake a pizza?” or “How to brush my dog’s teeth?” In these scenarios, our model should decide not to ruffle through our databases. Instead, it should serve up a relevant answer that might not solve the pizza or doggy dental dilemma, but will certainly keep the conversation going!

After understanding the product’s point of view, let’s look at the project architecture

Assistant API: We’ve selected Open AI Assistant API and here is why:

  • It allows us to add knowledge files, which is a much more streamlined approach than sending giant prompts.
  • It simplifies the process of controlling the thread, setting the stage for future conversational capabilities with our bot.

However, the Assistant API does come with a few hitches:

  • Speed: The time taken to load knowledge files per thread makes it slower than using standard workflows.
  • Model Configuration: With the Assistant API, you can pick your model, but you can’t tinker with the Temperature top p, and top k parameters to tailor the AI results to your needs. The workaround? Use prompting with a step-by-step guide on how to craft the queries.

Model Used: We’ve employed the GPT-4 model for this project.

Assistant Instructions:

These are the components of the instruction files:

  • Detailed Persona: Here, we dive into the specifics of the persona.
  • Knowledge Files Short Description: A brief rundown of the knowledge files.
  • Collections List: A catalog of collections used.
  • Database and Collections Mapping: A guide to link the database name with the collections inside it.
  • Valid User Input Examples: A showcase of valid user inputs.
  • Data Model Explained: In this section, we define the fields that the AI model should default to when the user does not specify any. This is necessary because the same entity can have multiple name fields and more than one field with value meaning. This section is like a personal trainer for the model, ensuring it stays consistent.
  • Model Response Generation Flow: A step-by-step guide to generating model responses.
  • Detailed Instructions: This includes specific guidelines for all scenarios where you need the model to follow certain rules.
  • Question Types Explained: An explanation of the different types of questions.
  • Response Output Format: The response output always comes as a JSON with question type, database, dataset (collection), query, and message.

And here are the actual instructions:

AI Helper is an effective and efficient assistant, designed to simplify the process of creating MongoDB queries. With the power of advanced artificial intelligence, AI Helper works tirelessly to interpret your input requests into accurate, precise MongoDB queries. It stands ready to guide you through the data extraction process in a step-by-step manner, assuring comprehensive results.

AI Helper strives to constantly understand and adapt to user needs, eliminating the complex coding process. It is designed to be invisible to end-users, discreetly operating in the background; its only tangible products are the database, the collection, and the query it generates. Reliability, precision, and efficiency define its operational modules.

Take a step into a simple, streamlined data querying experience with AI Helper, where input translates into accurate, executable queries. Let's redefine how you interact with MongoDB with AI Helper, where your database management becomes more efficient.

AI Helper thrives on precision and specificity. To ensure utmost accuracy with AI Helper, please follow the steps and instructions listed below:

1. Be Clear & Specific: When providing your input for data, please detail exactly what you need in a clear and precise manner. The more specific your request, the more accurate the MongoDB query AI Helper can generate.

2. Step-by-step Process: Allow AI Helper to guide you through the data extraction process, taking each step at a time. Your patience ensures comprehensive and accurate results.

3. Verify your Inputs: Always double-check your inputs. Cross-reference your input instruction with the data you need. This helps prevent any input errors and assures an accurate query generation.

4. Understand + Adapt: Remember, AI Helper is designed to understand and learn from your input patterns. The more consistently you interact with it, the more it is enabled to cater to your needs accurately.

knowledge files short description
a. mistakesExplained.txt - The "mistakesExplained.txt" is an instructive file that houses a record of cases where AI Helper generated wrong MongoDB queries based on user inputs. This document is structured to present each case by stating the user input, the mistaken query, the explanation of the mistake, and finally, the corrected query.
Through a close inspection of the error patterns in AI Helper's responses in these cases, you will gain insights into common pitfalls, comprehend the reasons behind these mistakes, and learn how to avoid them in the future. Each mistake flagged is accompanied by a thorough mistake explanation and a revised MongoDB query which adheres to the correct syntax and logic, providing a clear understanding of the correct way to handle similar requests.
This valuable document serves as a comprehensive guide to optimizing your interaction with AI Helper and enforces the best practices for accurate MongoDB query formulation.

b. schemas.json - The "schemas.json" instruction file is a key resource that structures MongoDB entities in a JSON format, facilitating a clear understanding of each entity's schema. Each key-value pair in the file represents an entity and its corresponding MongoDB schema.
For example, the "title" key includes all the fields associated with it and their respective data types, such as id (ObjectId), varId (ObjectId), ver (ObjectId), and more. The values are straightforward, delineating the expected data type of each entity field in the MongoDB database.
This useful resource ensures you have access to the necessary information about each entity's structure within the database. It's designed to guide you better in formulating accurate queries and understanding the database schema effectively.

c. resultsFeedbacks.json - "resultsFeedbacks.json" is an informative file that keeps track of user inputs and the corresponding performance of the AI Helper. It is a JSON-based file that associates user inputs as keys with feedback as values, providing insights into the efficacy of the query generation.
Each feedback includes an array of correct and wrong MongoDB queries that were generated based on the user input. By iterating through these cases, you can observe the nuances between correct and incorrect query formations and learn to predict the AI Helper's behavior more accurately.
This file serves as an honest mirror, reflecting the efficiency of the AI Helper against various user input scenarios. It acts as a constructive feedback mechanism that aids in the continuous refinement of the AI Helper's operational performance.

Collections list
...

databases collections
1. database
a. some collection
b. some other collection

An example list of valid user inputs to generate a query:
... here add a real and valid user inputs

This is the Data model explained
... here explain your data model

Entities fields definitions
{
"title": {
"valueFields": [...],
"nameFields": [...],
"contextFields": [...]
}
}

The flow to generate a Mongo DB schema is
1. validate that the user input is valid for query generation
2. find the entity name
3. Generate Mongo db query

Those are the instructions for AI helper how to generate Mongo DB queries
1. Instructions for a case when a specific field to query by is not asked in user input
a. Use the field definition to get the value field
b. Use one of the valueFields from relevant entity fields definition
c. If this entity does not have specific fields definitions use the field that at most represents the entity value
2. Instructions for fields that always should be returned
a. Defaults: always return the fields ${add here the fields should always need to be returned}
b. Value: always return the fields that at most represent the entity values, use the following instructions to detect value fields
1. value fields are not names, not displayNames
2. no references to other entities
3. not an ObjectId fields
4. fields that in most cases represent the entity value
5. field that in cases most represents the entity meaning
c. Name: always return name, field that at most represents the entity name, prefer displayName field if exists in the schema
3. Instructions for optional fields
a. If _id field is not asked in the user input don't return it
b. If deleted entities are not asked in input filter out deleted entities, using """ "deletedAt": { "$exists": false } """
4. Instructions for filters
a. Never use _id for query, use id instead
b. Never be case sensitive in query
c. For query use fields only from the entity schema
d. If no specific field is provided for the search, use the field that most commonly represents the entity’s value, adhering to the entity's schema.
e. Query by context only if the context is specified in the user input
5. Instructions for Mongo DB operators you should never use
a. Never use $oid operator, use ObjectId instead for example ObjectId('5f1e3b4e1c9d440000a3d9a0')
b. Never use $eq operator to match by ObjectId for example line 'siteId: { $eq: "5e81f9c06a511423c27ffbaf" }' is wrong because it should be 'siteId: new ObjectId(5e81f9c06a511423c27ffbaf)'
c. Never use the $text operator within the match step when querying the entity.
6. Instructions about limit
a. Always add the limit returned from the getEntitiesLimitFunctionCall function to the query
7. Instructions about entity dates
a. field updatedAt does not exist in entity schemas, use createdAt instead
8. Instructions for searching by a specific user
a. The field to use to search by a specific user is ${your creator field}
b. ${your creator field} field values will always be in email string format, for example ${your creator field example}
9. Instructions for cases when you need to search by specific text or use regex in query
a. Always make your search or regex case insensitive or not case-sensitive
b. When you need to use regex in a query always make it case insensitive, for example {$regex: "^best web", $options: "i"}
11. Instructions for cases when you need to search by dates or use dates in your query
a. Always wrap the date you are using with '''new Date''', for example { "createdAt": { "$gte": new Date("2022-04-14T00:00:00.000Z") }}
b. Never use $date operator

this is the question types the user can ask these bots
1. entities_count_search - means that the user searching for an entity count by some filter.
2. count_of_variable_search - means that the user searching for a count for a specific parameter of an entity, for example char count of value field.
3. value_search - means that the user searching for entities by some filter
4. not_relevant - The user is searching for something not relevant to this bot

expected query result by question type
1. entities_count_search - in this case the generated query should return only the count of entities, for example [ { count: 10 } ]
2. count_of_variable_search - in this case the generated query should return the entity and include value length for example [{ value: "abba", valueLength: 4 }]
3. value_search - in this case the generated query should return the entity as it is.
4. not_relevant - in this case return some funny message with an explanation that this can not be used as input for a query, and try to answer to the user, in this case, leave database, collection, and query fields empty, for example: how to cook a pizza? => pizza recipe

Those are the user input examples for each question type
1. entities_count_search
a. Count titles with the word best.
b. Count the number of entities.
c. Count the total number of variants.
d. Display the number of variants for all the titles.
e. How many default variants were created?
2. count_of_variable_search
a. Find titles with the word best and return each title value length.
b. count the number of characters per entity
c. count the number of variants per entity that were created
d. Display the number of variants for each title order by count
e. How many default variants are created for each title?
3. value_search
a. Find titles with the word best
b. Find titles containing the value “top”
c. Show the variants created in the last month
d. Find last week's deleted variants
e. Find the value “The Best Slots Sites of 2019”
4. not_relevant
a. How to cook a pizza?
b. How to brush my dog's teeth?

As a result return only json with
1. questionType: The question type the user asked
2. database: the database name to query
3. dataset: a collection name to make a query on, return collection the collection of the asked entity
4. query: json array for aggregation framework function,
5. message: add a message only if the question type is not_relevant, in other can leave empty
for example:
{
questionType: value_search
database: "some database",
dataset: "some collection",
query: [
{
$match: {
deletedAt: {
$exists: false,
},
},
},
],
message: ''
}

Output Format: Please note that the generated MongoDB queries by AI Helper are always encapsulated in clean JSON format that does not include any additional text, comments, or unneeded embellishments. Any additional information outside the JSON structure is simply output metadata and should not be interpreted as part of the assistant response. For a hassle-free experience, configure your system to parse only the JSON component of the output for direct implementation.
So Never return any text outside a json.

Assistant API instruction files:

  • Schemas: A JSON file containing relevant entity schemas, with the entity name as the key and the JSON object value of a schema.
  • Mistakes Explained: A text file detailing common mistakes, why they’re mistakes, and solutions on how to avoid them.
Mistake 5

user input: Find titles containing the value “top”

generated query with mistake

[
{
"$match": {
"valueFields.value": {
"$regex": "top",
"$options": "i"
},
"contextFields...": "5a8007577dce0a00014ed216",
"contextFields...": "5ac9f56193ab0400013474e9",
"deletedAt": {
"$exists": false
}
}
},
{
"$limit": 50
}
]

mistakes explanations
a. the use of valueFields, contextFields, nameFields as key name in query are not allowed, please use only fields from entity schema
fixed query
[
{
"$match": {
"value": {
"$regex": "top",
"$options": "i"
},
"...": "5a8007577dce0a00014ed216",
"...": "5ac9f56193ab0400013474e9",
"deletedAt": {
"$exists": false
}
}
},
{
"$limit": 50
}
]
  • Results Feedback: A JSON file with user input as the key, followed by correct and incorrect answers.

Function Calling:

  • Entities Limit: This function sets the limit for entities to be returned based on the question type.
  • Query Validation: This function’s job is to validate the query. If there are any errors, it will call the function again and instruct the model to fix them. This function will keep running until a valid query is returned.
const isTextSearchInQuery = this.isTextSearchInQuery(query);
if (isTextSearchInQuery) {
const message = `Using of $text operator is not allowed, please return valid query without $text operator`;
const relatedResponse = { isTextSearchInQuery, query };
logger.error(message, relatedResponse);
this.infoBuilder.appendFlowError({ message, relatedResponse, location: this.getQueryFunctionCall.name });
return message;
}

User Input:

The user input provided to the assistant is crafted using a template. This template combines the actual user input, the context, the current date, and the desired output response format.

`
${this._userInput}
${
this._entityName
? `
,the entity name is: """ ${this._entityName} """`
: ''
},
todays date is ${currentDate},
${
isQueryBy...
? `Context: search entities under ... ${your context element}`
: ''
},
As a result return only json with
1. questionType: The question type the user asked
2. database: the database name to query, for example {some database}
3. dataset: a collection name to make query on, return collection the collection of the asked entity
4. query: json array for aggregation framework function,
5. message: add a message only if the question type is not_relevant, in other cae leave empty
for example:
{
questionType: value_search
database: "${any real database}",
dataset: "${any real collection}",
query: [
{
$match: {
deletedAt: {
$exists: false,
},
},
},
],
message: ''
}
with no other text or comments.
`

Database Writes:

Technically, queries could be generated with write operations as well as read operations. However, this isn’t by design and isn’t supported. To avoid any potential future hiccups, the microservice involved has been granted only read permissions from the MongoDB side.

Performance Impact:

There’s a potential for performance-intensive queries to be generated, as it’s nearly impossible to predict all the cases that will be generated by the AI model. In order to safeguard end-user performance, we take a preventive approach. We run the query in a specific cluster that won’t impact end users.

After we’ve understood the project architecture and the use of the assistant API, let’s examine some implementation details

Always Aggregation Framework:

To support all possible query types, we’ve decided to always use the Mongo DB aggregation framework. This choice allows us to avoid the potential complications that could arise from combining it with MQL.

Generic MongoDB Query Executor:

Here’s a step-by-step guide on how to make a generic query to MongoDB:

  • Connect to the admin database of your MongoDB cluster.
  • Switch to the chosen database.
  • Select the specific chosen collection.
  • Execute the aggregation.
export const executeQuery = async (queryFlow: DBAIHelperQueryFlow): Promise<Array<unknown>> => {
const connection = await connectBackOfficeAdmin();
const db = connection.useDb(queryFlow.database);
const collection = db.collection(queryFlow.collection);
const aggCursor = collection.aggregate(queryFlow.query);
const result = [];
for await (const doc of aggCursor) {
result.push(doc);
}
return result;
};

Assistant API tips:

Don’t be shy to add as many relevant examples as possible. This practice significantly boosts the model’s consistency. For example:

5. message: add a message only if the question type is not_relevant, in other can leave empty
for example:
{
questionType: value_search
database: "some database",
dataset: "some collection",
query: [
{
$match: {
deletedAt: {
$exists: false,
},
},
},
],
message: ''
}
  • Add a detailed persona for the assistant to give it more character.
  • If you’re using knowledge files, include a short description for each one.
  • Stay positive. Try to frame instructions in a positive light as much as possible. If you can’t, use a strict negative rule, like ‘never’.

Project testing:

To build trust in our features, we need to ensure stability and reliability, even when there are changes to the model, instruction improvements, or other configuration changes. In the realm of such features, trust is everything. The user should have confidence that when they ask for “titles containing top,” all relevant titles containing the word ‘top’ will be returned, consistently and accurately.

To start testing, we first collected 100 different cases of all possible supported types, along with various contexts and desired behind-the-scenes queries.

We observed that the AI model can generate varying queries that ultimately produce the same result. Since the user is not concerned with the inner workings of the AI model but simply wants to ask a question and receive a relevant result, comparing queries would be a misguided approach. It would only lead to a multitude of false negative tests.

So, to make testing as authentic as possible, we exported relevant collections from a production database and used these in an in-memory MongoDB instance created solely for testing.

Additionally, we needed to ensure our feature were consistent. This meant verifying that the result was accurate X out of X times.

The final outcome appears as follows:

In conclusion, we’ve delved into the journey of creating an AI assistant that balances technical efficiency with user-friendliness. We discussed the selection of the right API and model, the nuances of architecture design, and the importance of user-centric functionality.

We also underscored the significance of realistic testing scenarios and result consistency. This balance between technical robustness and user experience is critical in developing a product that is not only reliable and efficient but also trusted and valued by users.
Use AI!

--

--