TRIE based Encrypted data structure to support Partial Search over Encrypted data
Introduction
There are various threats surfacing against information privacy and security in data storing. Due to the sensitive nature of data, maintaining confidentiality is the foremost requirement in data management. It will help the data owners make the data content unavailable to unauthorized users. In order to implement confidentiality, the data owners need to encrypt data before storing sensitive data in the data storage and executing queries on them. Application-level data encryption is a well-known solution to protect data confidentiality and deal with security threats. It is a process in which data is encrypted using a secret key before being sent to the data storage. In symmetric key based encryption mechanism, the user encrypts a message so that only the owner of the corresponding secret key can decrypt it. Without the secret key no one can retrieve any information from the encrypted data. However, as cryptography transforms data, it becomes quite difficult to execute a search query on the encrypted data. On the other hand, many a use case demands the client to issue queries on the encrypted data stored in the server to retrieve some important information. For instance, username is a PII (Personal Identifiable Information) and hence it cannot be stored in cleartext in a publicly accessible data storage. In a Web based user interface there is a requirement to show list of matching names when the logged-in user will specify the username partially as part of the search query. In earlier days, the respective server-side implementation would have decrypted all the usernames stored in the data storage first as the mandatory step prior to selection of the matching names. That approach was extremely costly and could not scale easily. Now-a-days numerous studies are being carried out to find ways in order to implement a searchable encrypted data storage where the data owner will only need to encrypt and decrypt data whereas the evaluation of certain classes of queries on the data will be possible without decrypting the original ciphertext. Lots of cloud storage are also emerging to offer the solution in this space.
Here we will see a simple approach which has used an efficient Queryable Encryption algorithm to build a “thread-safe” data structure for the encrypted data. Using the data structure, the aforementioned requirement of partial search over encrypted entities can be very easily implemented without compromising with the confidentiality of the PII or any other sensitive data. The entire scheme for substring queries has been constructed using a TRIE data structure. Thus, the search complexities have been brought to optimum limit which is equal to the length of the unencrypted substring query [ O(n), n → length of the substring ]. The same data structure can be used for complete string search purpose as well.
Implementation Approach
Before we build the scheme, let us first define encryption and decryption routines as per your requirement. We may use the following code to implement the methods.
We can now construct the TRIE node as shown below:
The TRIE node class represents a node that consists of multiple branches. Each branch represents a possible character of input phrases/strings. The node has markers to mark the starting node and the last node of every string. The “count” field in a node is used to track the total number of strings where the node is the last node.
The “getoraddchildnode()” method in the TRIE node class is used to insert the characters of the input string into TRIE. It inserts every character as an individual TRIE node. The TRIE node class and the Wrapper class (explained later) ensure that each character in the input string is mandatorily encrypted before being inserted as the TRIE node. Not a single character in the entire data structure is in form of cleartext. Before insertion, the class checks for the existence of the same character in the data structure. If the character is not found, a new TRIE node gets created and added in the “children” map against the hash value of the character as the key. Otherwise, the existing node is used. If the character is the last node of the input string, “endofword” is set to “True” and the value of “count” gets increased by “1”. If the input string is a prefix of the existing character sequences in TRIE, the last node of the string is marked accordingly as the end of the word.
Next, we will use the below class as the Wrapper class of the TRIE node to accept input strings from the client applications as well as search for a string or substring in the TRIE based data storage.
Here you may use normal “threading.Lock” object too in place of Reentrant Lock or “threading.RLock” object. Similarly, as per your requirement you may specify the password and create the secret key. The password can also be retrieved on the fly from database or any secrets management service.
During TRIE node creation phase, the “add()” method encrypts each character in the input key along with cumulative hash computation as the prerequisite. It is completely thread safe. It takes a replica of the original TRIE data structure before insertion of the key. It has implemented this pattern to prevent dirty read of data stored inside the TRIE during update operation. Performance might get impacted due to “deepcopy” based replica creation in case the TRIE is large. A custom and efficient cloning algorithm can be used in lieu of default “deepcopy” given the “add” flow is needed to be faster. Otherwise, this method should suffice.
Methods like “countwordsstarwith()” and “getwordsstartwith()” have implemented both Breadth-First-Search and Depth-First-Search algorithms to return the count and list of matching words respectively corresponding to a partial or complete string search request.
“dumpcontent()” gives a complete dump of the entire TRIE data structure.
Validation
The following code snippet has captured a sample usage of the Encrypted TRIE data structure. You may try it using multiple threads also.
Output:
Note:
For sensitive data storage (like PII) if disk based encryption turns out to be a costly option, the proposed Encrypted TRIE data structure can be used as a viable cost-effective alternative. The storage can be serialized and stored in a Cache like REDIS and used as and when required like the following:
Summary
This is just a first cut solution of Queryable Encryption scheme for the partial or complete search of encrypted sensitive data. The code can be enhanced in various ways or ported into other languages like Java, Go etc. Also, more rigorous testing can be done by adding lots of keys and searching for various substrings in a multithreaded environment. I hope this might have helped you in understanding the overall concept of Searchable Encryption using TRIE data structure. Instead of TRIE, a Suffix Tree i.e., a compressed TRIE could also be used to implement the functionality.
References
1. https://sciendo.com/de/article/10.1515/popets-2015-0014
2. https://www.koreascience.or.kr/article/JAKO202032265179388.page
Blog Reviewer: Vibhor Jain
Thanks for reading!