Vector Search: Unlocking the Power of Unstructured Data

Ark Mahata
Tech Padawan Chronicles
12 min readJun 25, 2023

I must admit upfront that I’m not an expert in AI/ML. However, like many others, I’ve always been curious about the visual search feature on Pinterest. I stumbled upon MongoDB’s vector search launch, and it caught my attention. Seeing someone effortlessly find a green t-shirt using vector search got me thinking. So here I am, with a basic understanding, eager to share my insights with you in this article.

Introduction

As we begin, it’s essential to recognize the sheer scale of data being generated. From social media posts and emails to images and documents, this vast sea of information holds tremendous potential for valuable insights and knowledge. However, this data is unstructured in nature & thus presents a significant challenge. In this article, we will explore the concept of vector search, an incredibly powerful tool for making sense of unstructured data.

From 2020 to 2025, International Data Corporation (IDC) forecasts new data creation to grow at a compound annual growth rate (CAGR) of 23%, resulting in a staggering 175ZB of data creation by 2025. Furthermore, an estimated 80% of this data will be unstructured, highlighting the need for effective methods to unlock its value.

Within this context, vector search emerges as a transformative solution. By representing unstructured data using vectors and utilizing specialized vector databases, we can efficiently organize, retrieve, and analyze this data, allowing us to uncover meaningful patterns and insights from unstructured data sources.

So, let’s discover the wonderful world of unstructured data & vector search.

Understanding Unstructured Data

Unstructured data presents a wide range of complexities and diversities, setting it apart from structured or semi-structured data. Unlike structured data, unstructured data can take any form, be it text, images, audio or videos and it can vary in size from small to extremely large. This poses a significant challenge when it comes to transforming and indexing such data efficiently.

Imagine you come across a unique and captivating bird during a hiking trip. You admire its vibrant colors, distinctive beak, and melodious chirping. Eager to learn more about it, you attempt to search for information online. However, since you don’t know the bird’s name, your search queries like “colorful bird with unique beak and beautiful song” or “bird found on hiking trail with vibrant feathers” might not yield accurate results.

In such instances, the limitations of traditional keyword-based searches become evident. They heavily rely on predefined terms and precise naming conventions. But what if you could search using the visual features or descriptions that you remember? What if you could find that fascinating bird by uploading a picture or describing its characteristics?

Unstructured Data Definition

Representation of unstructured data plays a crucial role in making sense of its inherent complexity. While unstructured data lacks a predefined structure, it is essential to find ways to represent it in a structured form to enable efficient processing and analysis. This is where the concept of vectors embeddings comes into play.

Vector embeddings, also known as feature embeddings, are mathematical representations of objects or entities in a vector space. They capture the essential features and characteristics of the object, allowing for meaningful comparisons and analysis.

In the case of our bird — “bird found on hiking trail with vibrant feathers” (Rainbow Lorikeet), a vector embedding can be created to represent its unique attributes. These attributes may include vibrant colors, distinctive beak shape, and other relevant features that define the bird. By transforming these features into numerical values, we can create a vector representation that captures the essence of the Rainbow Lorikeet.

Rainbow lorikeet

A crash course on embeddings:
For example, let’s consider a simplified vector representation for the Rainbow Lorikeet with three dimensions: colorfulness, beak curvature, and melodiousness. The values in each dimension can be scaled or normalized to fit within a certain range.

A possible vector embedding for a Rainbow Lorikeet could be [0.8, 0.9, 0.7], where a higher value in the “colorfulness” dimension indicates more vibrant colors, a higher value in the “beak curvature” dimension represents a more distinctive beak shape, and a higher value in the “melodiousness” dimension suggests a more melodious chirping.

Vector embeddings provide a powerful way to represent and compare objects in a structured and numerical form, enabling various applications such as similarity search, recommendation systems, and clustering analysis. They allow us to leverage mathematical techniques to gain insights and make meaningful connections between different entities, like the Rainbow Lorikeet and other birds, based on their shared characteristics.

Overall, by transforming different types of unstructured data into vectors using techniques like word embeddings for text and image embeddings for images, we can bridge the gap between unstructured and structured data. This structured representation enables powerful mathematical operations, similarity comparisons, and advanced algorithms to be applied, facilitating efficient search, analysis, and retrieval of insights from unstructured data.

Unstructured data processing

Let’s explore how we handle and analyze unstructured data before we discuss vector databases. When dealing with structured or semi-structured data, finding or filtering specific items in the database is relatively straightforward. For instance, querying MongoDB for the bird named Rainbow lorikeet can be achieved with a code snippet like this in Java:

Document document = collection.find(eq("Bird", "Rainbow lorikeet")).first();

This querying approach is similar to that of traditional relational databases, which rely on SQL statements to filter and retrieve data. The underlying principle remains consistent: databases for structured or semi-structured data use mathematical or logical operators to filter and query information based on numerical values or strings. However, traditional databases are deterministic systems that provide exact matches for a given set of filters.

https://newsroom.pinterest.com/en/post/our-crazy-fun-new-visual-search-tool

In contrast, vector databases operate differently. Instead of using SQL statements or data filters, queries in vector databases are performed by specifying an input query vector, which represents the embedding-based representation of the unstructured data. For example, you can accomplish this with the following code snippet in Java:


List<SearchResult> results = collection.search(embedding, “embedding”, params, 10);

Internally, large-scale queries on unstructured data collections utilize a suite of algorithms known as approximate nearest neighbor search (ANN search). This optimization technique aims to find the closest point or set of points to a given query vector.

Vector database

A vector database is a specialized database management system (DBMS) that is specifically designed to store and handle vector embeddings. It employs innovative techniques for efficient storage, indexing, and query processing of high-dimensional vectors. Vector databases provide essential data management capabilities, such as CRUD operations, and offer language bindings to popular data science languages like Python, SQL, Java, and TensorFlow. They also incorporate advanced features like high-speed ingestion, sharding, and replication.

The primary purpose of vector databases is to address critical query and algorithmic styles that are commonly encountered in applications such as similarity search, anomaly search, observability, fraud detection, and IoT sensor analytics. These emerging styles have become increasingly important due to the digital transformation and the rise of generative AI.

For example, Amazon utilizes similarity search to suggest personalized content to users, such as music or movies based on their interests and the preferences of others. Similarity search extends beyond Amazon and finds applications in recommendation systems, fraud detection by identifying access patterns, image similarity analysis, and even detecting data quality issues.

Similarity Search

The foundational use case for vector databases is similarity search. In traditional data stores, manual tagging of items with metadata is required to identify common characteristics. However, vector databases offer a superior solution by utilizing vector embeddings that encode item representations.

For example, let’s consider a movie recommendation system. Instead of relying on manual tagging, vector embeddings capture the characteristics of movies such as style, story narrative, tone, topic, and demographics. These embeddings enable efficient comparison and measurement of similarity between movies based on their positions in a high-dimensional space.

By analyzing the distance or similarity between entities in this space, vector databases can quickly search for similar items and provide personalized recommendations. This approach allows for more accurate and effective recommendation systems, even when dealing with vast datasets and diverse user preferences.

The ability of vector databases to simplify the encoding of multiple dimensions and facilitate swift similarity search is invaluable for various applications, including customer experience enhancement, fraud detection, and product recommendation engines. By leveraging the power of vector embeddings and similarity search, businesses can deliver tailored and relevant content to their users, improve fraud detection capabilities, and enhance the overall customer experience.

Automating vector embeddings

Neural networks play a crucial role in automating the process of generating vector embeddings. By leveraging deep learning techniques, neural networks can learn intricate patterns and representations from raw data, transforming them into meaningful vector embeddings.

Here’s how the process typically works:

  1. Data Preparation: The input data, such as images, text, or audio, is preprocessed and transformed into a suitable format for neural network training.
  2. Neural Network Architecture: A neural network model is designed with layers of interconnected nodes, also known as neurons. Each neuron performs calculations on the input data and passes the output to the next layer. The network architecture can vary depending on the specific task and data type.
  3. Training: The neural network is trained on a labeled dataset, where the input data and their corresponding target outputs are provided. Through an iterative process called backpropagation, the network adjusts its internal parameters to minimize the difference between predicted outputs and actual outputs.
  4. Feature Extraction: During the training process, the neural network learns to extract relevant features from the input data. These features capture important patterns and representations that are relevant to the task at hand.
  5. Vector Embedding Generation: The output of one of the intermediate layers of the neural network, often referred to as the embedding layer, forms the vector representation of the input data. This layer captures the essential characteristics and learned features of the data in a compact vector form.
  6. Utilizing Vector Embeddings: The generated vector embeddings can be used for various purposes, such as similarity search, recommendation systems, clustering, or any other task that benefits from measuring similarities or distances between data points.

By automating the generation of vector embeddings through neural networks, one can efficiently extract rich representations from complex data types like images, text, or audio. This automation enables more accurate and scalable solutions for tasks that require vector-based computations and analysis.

Inverted multi-index

Inverted multi-index is a data structure used in similarity search to efficiently index and retrieve vectors based on their similarity to a query vector. It combines the benefits of inverted files and vector quantization to achieve fast approximate nearest neighbor search.

Let’s try to look at a simplified explanation of how the inverted multi-index works with an example:

  1. Indexing:
    - Given a set of high-dimensional vectors, the inverted multi-index starts by partitioning the vectors into clusters using a vector quantization technique like k-means.
    - Each cluster is represented by a centroid vector, which serves as a reference for the vectors belonging to that cluster.
    - For each centroid, an inverted list is created. The inverted list contains pointers to the vectors that are closest to that centroid.
  2. Querying:
    - When a query vector is provided, it is compared to the centroids to identify the closest centroid(s) using a distance metric like Euclidean distance.
    - The inverted lists associated with the closest centroids are then retrieved.
    - Within each inverted list, the vectors are ranked based on their distance or similarity to the query vector. The top-k most similar vectors are returned as the query results.

Let’s consider a dataset of bird images represented as high-dimensional feature vectors. The inverted multi-index is built as follows:

1. Indexing:
— The image vectors are clustered using k-means into, let’s say, 100 clusters. Each cluster has a centroid vector.
— For each centroid, an inverted list is created. The inverted list contains references to the images that are closest to that centroid.

2. Querying:
— A query image vector is provided.
— The query image is compared to the centroid vectors to determine the closest centroid(s).
— The inverted lists associated with the closest centroids are retrieved.
— Within each inverted list, the images are ranked based on their similarity to the query image.
— The top-k most similar images are returned as the query results.

By leveraging the inverted multi-index structure, the search process is accelerated as it allows for efficient pruning of candidate vectors and focuses the search on the most relevant clusters and vectors.

Approximate Nearest Neighbor (ANN)

Approximate Nearest Neighbor (ANN) is a technique used in vector search to efficiently find approximate nearest neighbors to a given query vector in a large dataset. The working principle of ANN involves constructing index structures that allow for fast retrieval of vectors that are likely to be close to the query vector.

Let’s say we have a dataset of high-dimensional vectors representing images. Each image is represented as a vector in a vector space. Our goal is to find the nearest neighbors of a given query image efficiently.

  1. Indexing:
    - In the ANN approach, an index structure is built using techniques like locality-sensitive hashing or random projection.
    - This index structure organizes the vectors in a way that similar vectors are grouped together, enabling quick retrieval.
  2. Querying:
    - When a query for a specific image is made, the ANN algorithm uses the index structure to narrow down the search space.
    - It identifies a set of candidate vectors that are likely to be close to the query vector based on the index.
    - These candidate vectors are retrieved from the dataset for further processing.
  3. Distance Calculation:
    - Once the candidate vectors are identified, the algorithm calculates the distances between the query vector and each candidate vector.
    - The distances can be measured using various metrics like Euclidean distance or cosine similarity.
  4. Nearest Neighbor Selection:
    - Based on the calculated distances, the algorithm selects the vectors that are the closest to the query vector.
    - These vectors are considered as the approximate nearest neighbors to the query vector.

The key idea behind ANN is to trade off some accuracy for faster query times. Instead of exhaustively searching the entire dataset, ANN algorithms use the index structure to quickly identify a subset of vectors that are likely to be close to the query vector. This approximation allows for significant speed improvements in large-scale vector search tasks.

Vector Databases Use Cases

By employing vectors, the translation overhead of queries is significantly reduced, resulting in enhanced speed and efficiency compared to non-vector representations. Vector databases have demonstrated a 100x speed improvement and 90% increase in efficiency, showcasing their optimal alignment with storage and utilization.

These performance enhancements are particularly beneficial in AI-powered applications such as:

  • Recommendation Engines
  • Natural Language Understanding
  • Fraud detection
  • Fault Detection
  • IoT-based Automation
  • System Observability
  • Cybersecurity
  • Algorithmic Trading
  • Surveillance
  • Security

Vector databases empower businesses to leverage the power of vectors and unlock valuable insights from their data, leading to improved decision-making, personalized experiences, and more efficient operations.

Conclusion

In conclusion, the world of unstructured data is vast and complex, presenting unique challenges for organizations and industries seeking to extract valuable insights from this wealth of information. We explored the concept of unstructured data and its diverse forms, ranging from images to text and beyond. Traditional structured and NoSQL databases, designed for structured and semi-structured data, have limitations when it comes to handling the complexity and variability of unstructured data.

Enter vector search, a powerful tool that offers a structured representation of unstructured data. By leveraging mathematical vectors and embeddings, we can transform unstructured data into a format that enables efficient storage, search, and analysis. Techniques like word embeddings for text and image embeddings for visual content provide meaningful representations of unstructured data, capturing semantic similarities and relationships.

We discussed how vector databases, such as Milvus, operate by performing queries based on input query vectors rather than traditional SQL statements or data filters. This approach, combined with approximate nearest neighbor search algorithms, allows for efficient and probabilistic processing of unstructured data. The tradeoff between accuracy and performance in vector databases offers flexibility in delivering relevant results at varying search runtimes.

As we continue to unlock the potential of unstructured data, the field of vector search and embeddings opens up new possibilities for applications such as visual search, recommendation systems, and more. By harnessing the power of machine learning and deep neural networks, we can unravel the insights hidden within unstructured data, enabling businesses to make informed decisions, enhance user experiences, and drive innovation.

In the ever-evolving landscape of data-driven applications, vector search stands as a promising approach to navigate the complexities of unstructured data. Embracing this technology unlocks a world of opportunities for organizations, researchers, and developers to explore, analyze, and leverage the untapped potential of unstructured data. So, let’s embark on this exciting journey and unleash the power of vector search to uncover the hidden gems within unstructured data.

Bibliography

--

--