【Technology Sharing】Search Technology in Document Management

Deepin
8 min readApr 29, 2022

Hello everyone, today I will talk to you about the things behind full-text search.

The search function in the file manager can find files matching the file name according to the words entered by the user, and return the results to the search list. If you want to match the content of the file instead of the file name, open the method as follows:

Open the “Configuration” interface, find and check the “Full Text Search” option.

Global search

Full-text search means that the computer indexing program establishes an index for each word in the article by scanning, and indicates the number and position of the word in the article. When the user queries, the retrieval program searches according to the pre-established index. Feedback the search results to the user.

The core technology of full-text search is document indexing, that is, how to record all the basic element information in the source document into the index database in a suitable form. Here, the process is drawn as follows:

The above picture should be divided into three parts. The first is the index library of the files in the middle blue part, which is the output of “create index” and the input of “query index”.

Among them, “create index” is the green part on the left side of the index library. Its function is to index the original document content that needs to be searched, and finally build an index library on the user’s computer; “query index” is the red part on the right side of the index library. Search keywords get search results from the index library.

Through the above flow chart, it is not difficult to find that the core of search is its “index library”, so how is this index library implemented? This is where the inverted index technique is used.

Inverted index

Full-text search technology has a long history, and most of them are based on inverted indexes. So what is an inverted index? It is an index structure that stores a mapping between words and where the word itself is located in one or more documents, and consists of two parts — a dictionary and an inverted list.

For example, let’s number the following sentences:

0:“How are you”

1:“How old are you”

2:“How do you do”

If you use the word as the index and the position of the sentence as the indexed element, then the index is inverted:

“How”:{0, 1, 2}

”are“:{0, 1}

”you“:{0, 1, 2}

“old”:{1}

“do”:{2}

If you want to retrieve “How old are you”, you can calculate it like this:

$$

\{0,1,2\}∩\{1\}∩\{0,1\}∩\{0,1,2\}

$$

In this article, the keyword is agreed as Term (such as a word in English, a word in Chinese), and the dictionary is a collection of Term. The following figure is the underlying storage structure of the index. The left side is the dictionary, and the right side stores the document encoding list containing the left string, which we call the inverted list.

When searching, the query keywords are first segmented and divided into multiple keywords, then the inverted list is returned according to the keywords, and finally the inverted list is integrated according to logical conditions (and, or, not), and the final search result can be obtained.

Dictionary

In the case of a large amount of data, there will be a lot of data in the dictionary. In order to achieve superior performance, the data in the dictionary must be loaded into the memory for easy use at any time. At this time, the structure of the dictionary is particularly important.

At present, there are many dictionary structures. The simplest one is sorting arrays, searching for data by binary, faster with hash tables, and disk search with B-tree and B+ numbers. An inverted index structure that can support terabytes of data requires time and There is a balance in space. Here are some common dictionary advantages and disadvantages:

data structureAdvantages and disadvantagesSort listSimple to implement, but poor performancehash tableHigh performance, but high memory consumptionjump tableIt occupies small memory and is adjustable, but does not support fuzzy query wellB-treeDisk index, easy to update, but the retrieval speed is slow, and there are many database applicationsdictionary treeThe query efficiency is only related to the length of the string, but it is only suitable for English dictionariesDouble array dictionary treeIt can be used as a Chinese dictionary, occupying memory, and there are many word segmentation tools usedFinite State Transducers (FST)Shared prefix, low memory consumption, but requires input in order, update is not easy

Now let’s take a look at a representative FST index structure. Its principle is to construct a minimum directed acyclic graph by inputting ordered strings, which can make the index memory usage low (the compression rate is generally between 3 times and 20 times), the fuzzy query support is good, and the query is fast, but it The disadvantages are also more obvious, the structure is complex, the input is required to be ordered, and it is not easy to update.

In order to understand FST more intuitively, here is a simple example to illustrate, we assume that a set of mappings are created: key (word) : value (word frequency is the frequency of words appearing in the inverted list)

$$

cat : 5, deep : 10, do : 15, dog : 2, dogs : 8

$$

Based on the input of this example, we can build the FST shown in the following figure, which is the minimum directed acyclic graph:

According to the above principle, the final index looks as follows. The Term dict index is cached in the memory in the structure of FST. After finding the block location of the term dict corresponding to the keyword from the Term dict index, go to the disk to find the term, which can greatly Reduce disk IO times.

After getting the index, let’s take a look at how the index data is stored.

index storage

The search process has a hierarchical structure when storing its full-text index structure, involving five levels: Index, Segment, Document, Field, Term, and their relationship As shown below:

Index (Index): The index consists of many files, which are placed in the same directory.

Segment: An index consists of multiple segments, and the segments are independent. New segments can be generated as new documents are added, and different segments can be merged when a threshold is reached.

Document (document): A document is the basic unit of an index, and a segment can contain multiple documents. Newly added documents are saved separately in a newly generated segment, and as segments are merged, different documents are merged into the same segment.

Field (Field): A document can be composed of multiple fields, such as a file, with multiple attributes such as path, modification time, content, etc. These attributes can be regarded as the field of the document. Different domains specify different indexing methods, such as specifying different word segmentation methods, whether to build an index, whether to store it, etc.

Term: Term is the smallest unit of the index, which is a string after word segmentation and language processing.

The above structures are subdivided from segments, so what is the storage mechanism of the segment structure? This involves two parts of “segment storage” and “segment merge”.

Segmented storage

The concept of segments is introduced in the search. Each segment is an independent data set that can be searched, and the segment is immutable. Once the indexed data is written to the hard disk, it cannot be modified. Under the idea of segmentation, the process of writing data is as follows:

Added. When new data needs to be created, a new segment is chosen to store the new data due to the immutability of the segment.

delete. When data needs to be deleted, since the segment where the data is located is only readable but not writable, a .del file needs to be added under the index file to specifically store the deleted data id.

renew. The update operation is actually a combination of deletion and addition. First, the old data is recorded in the .del file, and then an updated data is added to the new segment.

Segment merge

Although segmentation is more efficient than creating an index in full each time, since a new segment is added every time data is added, a large number of segments will exist in the index after a long period of accumulation. If it is too large, it will not only seriously consume the resources of the system, but also affect the performance of retrieval.

Query the data that meets the query conditions in all segments, and then merge the result sets of the query in each segment. In order to control the number of segments in the index, we must perform segment merge operations regularly, but if all segments are merged each time, it will cause A huge waste of resources, especially for “large segments” merging.

Then the idea of ​​segment merging should be to first group segments according to their size, and then merge segments that belong to the same group. Since the merging of super large segments requires more resources, the merging will not be performed when the size of the segment reaches a certain scale, or when the amount of data in the segment reaches a certain number.

Therefore, the merging of segments is mainly focused on the merging of small and medium segments, which can not only avoid excessive resource consumption when merging large segments, but also control the number of middle segments in the index. The schematic diagram is as follows:

Summarize

This article mainly introduces the process and some principles behind the search, hoping to help you better understand the search function in the document management.

about us

deepin is a Linux distribution dedicated to providing beautiful, easy-to-use, safe and stable services for global users, and it has always been the highest-ranked distribution developed by a Chinese team. It supports 33 languages around the world, has accumulated more than 80 million downloads, and has 135 mirror sites in 42 countries on 6 continents.

--

--