Shrimper — A Small Search Engine Crafted in Rust

Iva @ Tesla Institute
Artificialis
Published in
5 min readFeb 12, 2024
image by some robot

INTRODUCTION

Creating a search engine in Rust is an excellent way to start exploring the language’s strengths in performance and safety.

This project transitions indexing and searching concepts into Rust’s ecosystem, challenging but rewarding due to Rust’s unique syntax and paradigms.

We’d start by setting up the Rust environment, including essential tools and dependencies. Then, defining data models using structs and Rust crates tantivy for indexing/searching and serde for serialization. Through implementing a basic search engine, you'll learn to manage indexing and execute search queries!

OUTLINE & REQUIREMENTS:

1. Setup Environment

  • Rust and Cargo (Rust’s package manager and build system) installed. If not, you can install them from the official Rust website.

2. Project Setup

Create a new Rust project:

cargo new shrimp_engine
cd shrimp_engine

3. Dependencies

You might need a few crates (Rust libraries) to help with parsing and data handling. For example:

  • tantivy for indexing and searching text (similar to Lucene in the Java world).
  • serde and serde_json for JSON parsing if your data is in JSON format.

Add these to your Cargo.toml file:

[package]
name = "shrimp-engine"
version = "0.1.0"
edition = "2021"

[dependencies]
tantivy = "0.17"
serde = "1.0"
serde_json = "1.0"

4. Define Data Structure

Decide on the structure of the documents you’ll be indexing. For a basic example, consider a simple struct representing documents with a title and body.

use serde::{Serialize, Deserialize};
#[derive(Serialize, Deserialize, Debug)]
struct Document {
title: String,
body: String,
}

5 . The Index

Using tantivy, create an index schema based on your data structure, and then add documents to the index.

rustCopy code
use tantivy::{schema::*, Index, doc};
fn create_index() -> tantivy::Result<()> {
// Define the schema
let mut schema_builder = Schema::builder();
schema_builder.add_text_field("title", TEXT | STORED);
schema_builder.add_text_field("body", TEXT);
let schema = schema_builder.build();
// Create the index in a directory
let index = Index::create_in_ram(schema.clone());
// Get the index writer
let mut index_writer = index.writer(50_000_000)?;
// Add documents
let title = schema.get_field("title").unwrap();
let body = schema.get_field("body").unwrap();
// Example document
let doc = doc!(title => "Example Title", body => "This is the body of the document.");
index_writer.add_document(doc)?;
// Commit the documents to the index
index_writer.commit()?;
Ok(())
}

6. Searching

Implement a function to search the index.

You’ll need to create a searcher and query parser.

use tantivy::query::QueryParser;
use tantivy::collector::TopDocs;
fn search_index(index: &Index, query_str: &str) -> tantivy::Result<()> {
let reader = index.reader()?;
let searcher = reader.searcher();
let schema = index.schema();
let title = schema.get_field("title").unwrap();
let body = schema.get_field("body").unwrap();
let query_parser = QueryParser::for_index(&index, vec![title, body]);
let query = query_parser.parse_query(query_str)?;
let top_docs = searcher.search(&query, &TopDocs::with_limit(10))?;
for (_, doc_address) in top_docs {
let retrieved_doc = searcher.doc(doc_address)?;
println!("{:?}", retrieved_doc);
}
Ok(())
}

Putting it all together 🍤

Now, let’s combine the indexing and searching into a main function, where we can modify the documents, the index, and queries:

use serde::{Serialize, Deserialize};
use tantivy::{schema::*, Index, doc, query::QueryParser, collector::TopDocs, TantivyError};




#[derive(Serialize, Deserialize, Debug)]
struct Document {
title: String,
body: String,
}

fn create_index() -> Result<Index, TantivyError> {
let mut schema_builder = Schema::builder();
schema_builder.add_text_field("title", TEXT | STORED);
schema_builder.add_text_field("body", TEXT);
let schema = schema_builder.build();

let index = Index::create_in_ram(schema.clone());

let mut index_writer = index.writer(50_000_000)?;
let title = schema.get_field("title").unwrap();
let body = schema.get_field("body").unwrap();
let doc = doc!(title => "Example Title", body => "the body of the document.");
index_writer.add_document(doc)?;
index_writer.commit()?;

Ok(index)
}

fn search_index(index: &Index, query_str: &str) -> Result<(), TantivyError> {
let reader = index.reader()?;
let searcher = reader.searcher();

let schema = index.schema();
let title = schema.get_field("title").unwrap();
let body = schema.get_field("body").unwrap();
let query_parser = QueryParser::for_index(&index, vec![title, body]);

let query = query_parser.parse_query(query_str)?;
let top_docs = searcher.search(&query, &TopDocs::with_limit(10))?;

for (_, doc_address) in top_docs {
let retrieved_doc = searcher.doc(doc_address)?;
println!("{:?}", retrieved_doc);
}

Ok(())
}

fn main() -> Result<(), TantivyError> {
println!("Hello, Shrimp!");

// Create the index and store it
let index = create_index()?;

// Search within the created index
search_index(&index, "Example")?;

Ok(())
}

Let’s break down the crucial components and their roles in the system:

Serde

  • serde::{Serialize, Deserialize}: These traits allow for the easy conversion of Rust structs to and from a format suitable for saving (like JSON), which is essential for working with data that needs to be indexed or retrieved.

Tantivy

  • tantivy::{schema::*, Index, doc, query::QueryParser, collector::TopDocs, TantivyError}:

The components from the tantivy crate are used for building the search engine's core functionality, from creating an index to querying it.

Document Struct

  • Document Struct: Represents the data structure for documents to be indexed. Each document has a title and a body, mimicking a simple webpage or document in a real-world search engine.

the Schema

The schema defines the structure of the index, specifying which fields (here, title and body) should be indexed and how (e.g., stored, text-analyzed). An in-memory index is created, and documents are added to this index. Each document added is defined by the Document struct, which is then serialized for indexing. Changes are committed to the index, making it searchable.

The Shrimps’ Core Mechanism :

1- Index Reader and Searcher:

To search the index, an index reader is instantiated, creating a searcher capable of executing queries against the index.

2- Query Parsing and Execution

A query parser interprets a query string, transforming it into a query object based on the defined schema. The searcher then uses this query to find and rank relevant documents.

3- Retrieving and Displaying Results

The top matching documents (up to a limit) are retrieved and displayed. The ability to extract and review indexed content based on search queries.

Main Function

The main function ties everything together, first creating an index with at least one document and then performing a search within this index.

The simplicity of this setup demonstrates a fully functional search engine capable of indexing and searching text 🍤

Key Takeaways

  • The use of tantivy for indexing and searching provides a Rust-centric approach to text search, which offers high performance and safety.
  • serde's role ensures that complex data structures can be easily managed, serialized, and deserialized within the Rust ecosystem.
  • This example serves as a foundational framework, illustrating how Rust can be used to build search solutions 🍤

Conclusion

This example is intended to give you a starting point in search engine construction. Rust’s ownership and concurrency model, along with its type system, provide a robust foundation for building more complex and high-performance search engines.

We can expand this project by adding features like real-time indexing, advanced text processing, and custom scoring algorithms. Expect those features in the series of articles dedicated to search engines and information retrieval — all in Rust 🍤

--

--

Iva @ Tesla Institute
Artificialis

hands-on hacks, theoretical dig-ins, and real-world know-how guides. sharing my notes along the way; 📝