Creating your first search engine from scratch

Published in

Data Science at Microsoft

8 min readJan 17, 2023

End-to-end guide: Create a searchable Azure Cognitive Search Index in less than an hour (Azure Search Index with Cognitive Search from scratch code repository)

Recently at work, I was tasked with creating my very first Azure Cognitive Search Index from scratch. I had experience with Azure but using it for information retrieval was a totally new challenge.

I found multiple manuals and examples, but even the most comprehensive documentation left me with a lot of questions. Creating an index loaded with my documents still felt impossible, especially on my tight timelines.

In the end, I was able to make it work! But I wish there had been a source to guide to provide me with simple yet comprehensive instruction to get a minimum viable product up and running. So, I’m making my own!

This article details a set of plug-and-in-play set of actions for everyone to make their first search engine from scratch. This article is for data scientists, academics, or just the search-engine-curious. The aim is to make the first iteration of your search engine as smooth as possible.

You can find the code and more detailed description for all the steps described in this post (and more) in this accompanying repository, which also contains more detailed technical instructions.

What is the goal?

Imagine that we have a set of JSON documents. We want to be able to perform various searches over them with the help of Microsoft Azure Cognitive Search.

This was the task that I was given six months ago. I had a lot of questions, which can be split into three groups:

Preparation:

Which resources do I need to set up?
Which secrets are necessary to access from my Python code?

Search functionality:

How do I set up the index?
How do I index my data?

Search application:

How do I access the data in the index?
Are there ways to improve the search?

This article is going to walk us through all these steps. This text follows the contents of this Jupyter notebook.

In the example that we will go through, the documents will be plain JSONs (e.g., containing no nested structures).

Preparation: Necessary resources and secrets to access them

Given a functioning MS Azure subscription, what services do we need to create an Azure Cognitive Search Index? For the steps within this section, the easiest and fastest approach is to use the Azure portal’s visual interface.

First, we need to create an Azure Search Service (which can host one or multiple search indexes). We can find a detailed description of how to create one here.

In the next section, we will need only the name and admin key to access this service. To obtain the latter, select “Keys” on the left panel and then copy one of the provided admin keys. (Remember, for more detailed technical instructions, you can see the accompanying repository.)

Second, we will need an Azure storage account. Here are detailed instructions on how to create one. An Azure storage account can host several types of data storage; however, in this guide, for simplicity we will just use a blob storage. We will upload our JSON files there.

Lastly, write down a connection string for your storage account (along with the name of the blob storage). To get the connection string, click “Access keys” in the storage account’s left pane, then pick one of the connection strings:

Once we have performed these steps, we have everything we need set up to start creating our first search engine! Wahoo!

Search functionality: Setting up the index and loading the data

In this section, we are going to use Python with Search REST API. You can also find instructions on how to do this using the portal’s interface in the accompanying repository.

To make a search engine work, we need the following components:

Data Source — the data over which to search.
Azure Cognitive Index — what defines the data structure over which to search and performs the search.
Azure Cognitive Indexer — what indexes the data.

If it looks complicated for now, that’s fine. Let’s understand the purpose of each of these components.

Data Source

The Data Source refers to the source of the data to include in the index. Practically, it is a link to some data storage; it does not create new data. For instance, it means that we can have multiple data sources in our MS Search service that link to the same data.

In this walkthrough, we are going to link to a blob storage (for more details on how to create it, see the repo).

Azure Cognitive Search Index

The Azure Search Index is the key part of our search engine. It is what we will access to perform our search. What may seem complicated for an inexperienced user is that its initialization consists of two steps:

Creation of an empty index according to an index schema (we will talk about it in a bit!). At this point it is like an empty table that has column headings but no data yet.
Filling it in with the data. To do this, you need to create an Azure Search Indexer first.

Azure Cognitive Search Indexer

Azure Cognitive Search Indexer is the tool that populates an Azure Indexer with the data from a Data Source. Once we have successfully run it, your Azure Index will have the data over which to search.

Putting these components together

To make our search engine work, first, we have to link to the blob storage (see the repository for how to do this) and define the fields for the Azure Cognitive Search Index.

If we create an index manually on the portal, the field list will already contain the automatically generated id field.

As we can see on the screenshot, we can add fields manually.

However, adding fields manually may be tedious. If we use the REST API, it is easier to specify the fields with a JSON schema as a parameter into the REST API function.

One possible complication is that these documents must be in a particular format containing all of the listed fields. In addition to the fields, it may also specify additional features of the Azure index. The repository contains a template for a simple index schema and the schema that matches the sample files.

Once we have created the data source and the index, we are ready to populate the index with the data. To do that, we need to create the Azure Cognitive Indexer. As the repository shows, it can be performed with the same function:

At this point, we have set up all of the components to search over our data.

Search application: Ways to search over the data and improve the quality of the search

Now that we have fulfilled the goal of creating our search index, what are the ways to access it? There are at least four different ways:

Web interface on the Azure Portal
REST API
Search client (from azure.search.documents.indexes)
Standalone demo app

For demonstration, we will use the Web Interface on the Azure Portal because the web interface provides the easiest way to do the search. We simply click the name of our index under ‘Indexes’. Then, we will see its search page:

In the “query string” field we need to type the text of what we would like to search.

But, as data scientists, we often need to automatize the search routine. In that case, a REST API is the best implementation because it provides a lot of flexibility. Conveniently, the repo provides the definition of a function (post_query(.)) that uses the REST API:

The output above shows the results of the search request “the best restaurant” ranked by the relevance scores returned by the REST API call.

The related repo supplies descriptions of all four approaches to accessing an index. One crucial thing to remember is that if we use the preview version of API (for instance, the latest one, ‘2021–04–30-Preview’), only REST API and the web interface provide access to its full new functionality.

How to improve our search engine

Now that we have our search minimum viable product, we may want to improve it.

One way that we could do that is to use semantic search instead of the default search relying on exact matches. Semantic empowerment allows our search to be strengthened with increased language understanding by adding context relevance to the query progress. But beware: semantic search is a premium feature.

The associated repository provides instructions to include a semantic configuration to your index. There we can find more information on how to create semantic queries or how to obtain a semantic answer.

Conclusion

Congratulations! We have built a search index from scratch by loading it with data, and, most importantly, using it to obtain information.

Of course, the topic has a lot more to explore, but even a journey of a thousand miles begins with a single step, and we have made it!

If you would like to learn more about search, the Data Science at Microsoft publication on Medium has a few more recently published articles on the topic:

Elena Labzina, Ph.D., is on LinkedIn.