AI 101 : Retrieval & Knowledge Bases for Beginners Part 2 — Vectors

8 min readJul 30, 2024

From Basics to Technical Depth: How AI Knowledge Bases Really Work

In our previous article “AI 101: Retrieval & Knowledge Bases for Beginners Part 1,” we introduced the concept of knowledge bases and their use cases. Now, we’re diving deeper into the technical aspects of how knowledge bases work and how AI systems utilize them. While understanding basic concepts is valuable, grasping the underlying technical details can elevate your comprehension to the next level.

This article will explore the intricate mechanisms behind AI-powered knowledge retrieval, covering topics such as indexing, vector databases, similarity search, and context injection. Let’s delve into the inner workings of these powerful AI systems.

1. Indexing and Vector Database Creation

The first crucial step in making knowledge bases accessible to AI is the process of indexing and creating vector databases. This step transforms raw text data into a format that AI can efficiently process and understand.

1a. Indexing

Indexing involves breaking large amounts of text(which would be your Knowledge Base) into smaller, manageable pieces. This step is vital because it:

Allows for precise information retrieval, enabling AI to match specific paragraphs or sentences rather than entire documents.
Improves efficiency by making it faster to search through smaller pieces of text.
Enables nuanced understanding, as different parts of a document might be relevant for different queries.

For example, instead of treating a 300-page book about dogs as one unit, we break it down into individual paragraphs. This way, when an AI is asked about “dog grooming,” it can quickly locate and retrieve the specific paragraphs about grooming.

1b. Vector Databases

After indexing, the next step is to create vector databases. This process involves converting text into mathematical vectors, which are essentially lists of numbers. Each vector represents a piece of text, with the numbers in the vector capturing various characteristics of that text.

These vectors are then stored in specialized databases called vector databases as we mentioned before.

Using vectors instead of plain text offers several advantages:

Efficiency: Computers process numbers faster than text

Imagine you’re at a international conference where everyone speaks a different language. There’s a universal translation system that converts all languages into a series of hand gestures.

Text processing is like having to listen to each person speak their native language, then mentally translating it to understand.
Number processing is like directly observing the hand gestures.

Computers, at their core, operate on binary (0s and 1s). Numbers are closer to this “native language” of computers. Processing text requires additional steps of conversion and interpretation, much like how you’d need to translate spoken words in various languages. Observing hand gestures (numbers) is immediate and requires no translation.

2. Similarity Comparison: Easier to compare vectors mathematically

Think of comparing fruits. If you’re comparing apples and oranges based on text descriptions, you’d have to read and interpret each description. But what if each fruit was represented by three numbers: [sweetness, size, juiciness]?

Apple: [7, 5, 6]
Orange: [6, 6, 8]

Now, comparing them is as simple as comparing these numbers. You can instantly see the orange is juicier and slightly larger, while the apple is a bit sweeter. This numerical comparison is much faster and more precise than comparing text descriptions.

3.Dimensionality: Vectors capture many aspects in a compact form

Imagine describing a person. In text, you might write: “John is tall, friendly, intelligent, hardworking, and good at sports.”

That’s a lot of words.
Now, represent John as a vector: [Height, Friendliness, Intelligence, Work Ethic, Athletic Ability]
John: [9, 8, 7, 9, 8]
This compact representation captures all those qualities in just five numbers. It’s not only more space-efficient but also allows for quick, precise comparisons with other people represented in the same way.
By using these vector representations, computers can efficiently store, process, and compare vast amounts of information, making complex tasks like searching through massive knowledge bases much faster and more effective.

1c. Vector Databases (Extended)

The process of turning text into vectors like [4, 3, 6] is called “embedding”. It’s a complex process typically done by neural networks trained on vast amounts of text.

Here’s a deeper explanation :

In a vector embedding system, each dimension (or number in the vector) represents a distinct characteristic or feature of the word. These characteristics are not predetermined by humans, but rather emerge from the AI’s analysis of vast amounts of text data during training. Here’s a more detailed explanation:

Training Process:

The AI is fed millions of sentences and documents.
It analyzes how words are used in various contexts, their relationships to other words, and patterns in their usage.
Through this analysis, the AI develops a multidimensional understanding of each word.

2. Emerging Characteristics:

As the AI processes more data, it starts to recognize patterns and associations.
These patterns become the basis for the different dimensions in the vector.
Each dimension/number captures a different aspect of the word’s usage and meaning.

3. Vector Representation:

In our example [4, 3, 6] for “dog”, each number represents a different characteristic:
4: Concreteness — This might emerge from the AI noticing that “dog” is often used in tangible, physical contexts.
3: Sentiment — This could arise from analyzing how “dog” is frequently used in positive contexts (like “loyal friend”) but sometimes in negative ones (“it’s raining cats and dogs”).
6: Frequency — This likely comes from the AI recognizing how often “dog” appears in the training data compared to other words.

4. Interpretation:

It’s important to note that in real-world applications, these vectors often have hundreds or thousands of dimensions/numbers[4,5,2,4,5……..] making them much more nuanced and complex.

5. Comparison and Relationships:

These vector representations allow the AI to mathematically compare words.

For example:

Dog: [4, 3, 6] Cat: [4, 2, 5]

In this system:

4: Concreteness — Both are equally concrete, often used in tangible, physical contexts.
3 vs 2: Sentiment — Dogs are slightly more positive, possibly due to phrases like “loyal friend”.
6 vs 5: Frequency — “Dog” appears more often in the training data than “cat”.

Another example shows how words with multiple meanings are represented:

Bank (financial institution): [5, 3, 4, 1, 4, 7, 6] Bank (river bank): [4, 2, 2, 5, 1, 7, 6]

These vectors show similarities:

Concreteness: Similarly solid things you can touch (5 vs 4)
Spelling: Exactly the same (7 vs 7)
Part of speech: Both used as nouns (6 vs 6)

And differences:

Frequency: The financial bank is discussed more often (4 vs 2)
Nature-relatedness: The river bank is more associated with nature (1 vs 5)
Economic-relatedness: The financial bank is more about money (4 vs 1)

This allows a computer to distinguish between different meanings of “bank” based on their vectors, even though they have some very similar features.

6. Continuous Learning:

As the AI is exposed to more data, these vector representations can be refined and updated, capturing evolving language usage and meanings.

2. Vector Comparison: How AI Finds the Right Information

Once AI has transformed words into vectors through embedding, it uses mathematical operations to determine which information in its knowledge base is most relevant to your question.

Here’s how it works:

Imagine you’re asking an AI about different types of pets. Here’s how it might use similarity search to find the most relevant information:

Your Question Becomes a Vector

You ask: “What are some good furry pets for apartments?” AI converts this to a vector: [0.8, 0.6, 0.3] In this simple example, these numbers might roughly represent: 0.8 — how much it’s about pets 0.6 — how much it’s about fur 0.3 — how much it’s about living spaces

2. Comparing to the Knowledge Base

Let’s say the AI’s knowledge base has information about different animals, each represented by a vector:

Dog: [0.9, 0.7, 0.2]
Cat: [0.7, 0.3, 0.8]
Fish: [0.1, 0.2, 0.9]
Hamster: [0.6, 0.8, 0.7]

3. Calculating Similarity

The AI uses cosine similarity to compare your question vector to each animal vector.

To do this, it applies a mathematical formula that calculates the cosine of the angle between two vectors. This formula takes into account both the direction and magnitude of the vectors, producing a value between -1 and 1. A value closer to 1 indicates higher similarity.

The goal of this formula is to quantify how “alike” two vectors are in terms of their orientation in multi-dimensional space, regardless of their magnitude. This allows the AI to compare the conceptual similarity of your question to each item in its knowledge base.

Without diving into the math, here’s what the results might look like:

Similarity to Dog: 0.97
Similarity to Cat: 0.88
Similarity to Fish: 0.41
Similarity to Hamster: 0.95

4.Finding the Best Matches The AI ranks these results from highest to lowest:

Dog (0.97)
Hamster (0.95)
Cat (0.88)
Fish (0.41)

5. Retrieving Relevant Information

Based on these scores, the AI determines that dogs, hamsters, and cats are the most relevant to your question about furry pets for apartments. Fish, having a low similarity score, would be considered irrelevant.

5.Context Injection

Before crafting its response, the AI injects additional context into the user’s query.

This process involves:

Identifying key concepts in the query (e.g., “furry pets”, “apartments”)
Retrieving related information from its knowledge base, prioritizing animals that match the “furry” description (e.g., dogs, cats, hamsters) and their specific care requirements in apartment settings
Integrating this information with the original query

The AI then processes this enriched query, allowing for a more comprehensive and relevant response.

6.Crafting a Response

The AI might then formulate a response like: “For furry pets well-suited to apartment living, you might consider:

Dogs, especially smaller breeds that don’t need as much space.
Hamsters, which are small and easy to care for in limited spaces.
Cats, which are generally well-adapted to indoor living. Each of these animals is furry and can make a great apartment pet, though they have different care requirements.”

This process allows the AI to quickly identify and prioritize the most relevant information from its knowledge base, even when the query doesn’t use exact matching words. It can understand that ‘furry pets’ and ‘apartments’ are key concepts in the question and find information that best aligns with these ideas.

Conclusion

To wrap up, we’ve covered how AI systems represent and process language using vectors. Next, we’ll take a closer look under the hood. We’ll explore the technical side, including embedding models and vector databases, to see how these systems are actually built and set up. This deeper dive will shed light on the inner workings of AI language processing and retrieval mechanisms.

Continue our AI 101 series to master the essential knowledge for building AI tools and agents, explained simply for beginners :