Semantic search in risk management using NLP pipeline

Published in

ING Blog

9 min readNov 30, 2021

“At ING, we use AI to help financial experts utilize existing data to identify risks, and ensure better information for future planning.”

Many industry-grade search engines exist (e.g. Elasticsearch) as general-purpose systems. While they have a general understanding of languages and include search algorithms for document retrieval, their accuracy can be sub-optimal. This accuracy ultimately depends on domain specificity and the terms provided by the users.

This issue is particularly apparent in domains such as risk management in Fintech. ING, as one of the biggest financial institutes, monitors risk issues continuously in order to take timely actions.

A critical function of the risk officers is knowing the relevant set of search terms to retrieve documents related to a specific topic. However, constructing a search criteria with all the appropriate search terms/phrases and their logical relations in the search engine remains complex, and can be prune to false-negatives and false- positives.

In this medium blog, we describe an end-to-end information retrieval (IR) system with custom embeddings to enable the automatic suggestion of highly relevant similar keywords, to ease the burden of requiring users to build complex search queries. Moreover, we show that custom word embedding outperforms general-purpose embeddings.

Text mining pipeline

Figure 1: IR system overview. Orange arrows indicate the preprocessing steps. Gray arrows indicate processing during real-time user queries.

We develop an end-to-end IR system, as depicted in Figure 1. The system, im- implemented in python, consists of 1) a text mining pipeline that preprocesses the records for indexing and for building language models, and 2) a front-end where the user queries the database with one or more search terms. The search terms are processed the same way by the text mining pipeline as for the records and are further expanded to include similar words based on the trained domain-specific language models.

The text mining pipeline consists of several key components (Figure 1). First, we integrate records from different data sources and unify them. Then we do the following steps:

Language detection: As the records can be in different languages, we automatically detect languages based on specific patterns and with spaCy so the appropriate language models can be used.

Tokenization: Tokenization is a way of separating a piece of text into smaller units called tokens. Here, tokens can be either words, characters, or subwords. Word Tokenization is the most commonly used tokenization algorithm. It splits a piece of text into individual words based on a certain delimiter.

Data anonymization: Sensitive and identifiable information (e.g. names, phone numbers, employee identification numbers, etc) is retrieved and removed from the records.

Abbreviation detection: Fintech documents are heavily polluted by abbreviations. Therefore, abbreviations are extracted using the Shwartz-Hearst method. The text is then cleaned using a combination of custom regular expressions and the removal of stop-words and punctuations.

Stopword removal: Stopwords are the most common words in any natural language. For the purpose of analyzing text data and building Natural Language Processing (NLP) models, these stopwords might not add much value to the meaning of the document. Generally, the most common words used in a text are “the”, “is”, “in”, “for”, “where”, “when”, “to”, “at” etc.

n-grams detection: An n-gram means a sequence of n words, for example, ”financial crime” is a 2-gram (a bigram), ”know your customer” is a 3-gram (a trigram). We derive n-grams using Gensim based on a PMI (Pointwise Mutual Information)-like scoring method. From the processed data, uni-, bi-, and tri-grams are jointly trained in deriving our custom word embeddings.

Word embedding: Word embedding is a learned representation for text where words that have the same meaning have a similar representation. Therefore, it is capable of capturing the context of a word in a document, semantic and syntactic similarity, relation with other words, etc. Word embeddings are in fact a class of techniques where individual words are represented as real-valued vectors in a predefined vector space. Each word is mapped to one vector and the vector values are learned in a way that resembles a neural network, and hence the technique is often lumped into the field of deep learning. Previous work suggests that multi-gram can improve the quality of the obtained word embeddings.

Sentence/Document embedding: Document embedding is supposed to be an ex- tension to word embedding such that Word2Vec learns to project words into a latent d-dimensional space whereas document embedding aims at learning how to project a document into a latent d-dimensional space The embeddings are indexed for later ranking by the relevance of the returning documents using FAISS (Facebook AI Similarity Search).

The outputs of this pipeline is structured/processed data (along with the indexed version) and the trained word embeddings, which are used by our search engine in the front-end for query expansions. In this blog, we mainly focus on word embeddings as it differentiates our search engine from conventional search engines.

Word embedding

Word Embedding Algorithms aim to create vector representations for words. They capture both semantic and syntactic meanings obtained from an unlabeled large corpus. An ideal word evaluator should be able to analyze word embedding models from different perspectives, see this paper. We discuss in more detail in the next section how we evaluate our word embedding.

Word embedding evaluation

Evaluating word embeddings fall into two categories: Intrinsic and extrinsic evaluation. Intrinsic evaluations directly test for syntactic or semantic relationships between words, in this case, we compare our custom domain-specific word embeddings with mostly used Pre-trained models and evaluate relatedness of scores for pair of words.

We collect around 160K records in the financial sector. To compare across the different word embeddings, we use both pre-trained and custom-trained models. Pre-trained models have been previously trained on a large corpus such as Wikipedia, Common Crawl, or Google News. Popular models include ones based on Word2Vec or GloVe. While using these general-purpose language models saves training time and the need for custom preprocessing of data, they potentially lack domain-specific word semantics. Therefore, we also train custom Word2Vec Skip-gram word embeddings based on our datasets for English, French, and Dutch languages.

Custom word embedding vs Pre-trained word embeddings

We first want to evaluate if our custom domain-specific language models embed different word semantics compared to the pre-trained general-purpose models. When we examine for example the word privacy, we observe that GoogleNews- and Wikipedia-based embeddings associate privacy with many words that tend to have different semantics (Table 1). However, the custom word embeddings noticeably give a more consistent set of similar words and with higher similarity scores.

Table 1: Pre-trained and custom language models with similar terms retrieved for the word “privacy”. Cosine similarity values are shown in parentheses.

Furthermore, the similar terms retrieved contain word phrases (bigrams) instead of just unigrams, and in many cases contain the word bank. As privacy is much more prevalent as a topic in our dataset related to the banking sector, these observations support a more relevant retrieval of similar words.

To assess whether the word relations are globally different between these models, we sample a large number of word pairs common in both custom and Google- News Word2Vec models and evaluate their cosine similarity values based on either model. While we observe that the custom models from run to run maintain high concordance among the word relations (Figure 2A), they are vastly different from those based on GoogleNews models (Figure 2B).

Figure 2: Sampling 15000 words from different embeddings, to show that the custom models retrieve relations that are vastly different from that compared to GoogleNews word relations.

Extrinsic word embedding evaluation

In extrinsic evaluation, we use word embeddings as input features to a downstream task and measure changes in performance metrics specific to that task. We perform extrinsic evaluations on the embeddings by building classifiers to assess whether the most common risk types can be predicted.

For extrinsic word embedding evaluations, we build a gated recurrent unit (GRU)-based model with an embedding layer, followed by a 64-unit GRU, a 16-unit fully connected layer with ReLU activation, and an output layer with sigmoid activation. The model is trained with Adam optimizer and binary cross-entropy loss function.

Based on GRU-based neural network models, we observe that our custom embedding performs just as well as the much larger time-intensive pre-trained Google- News embedding (weighted F1 scores of 0.8 and 0.78, respectively). Therefore, our custom embedding encodes language semantics on par with pre-existing models, while retaining domain-specificity for similar words extraction.

Word embedding on production

Data can change over time. As a result, predictions of the models trained in the past may become less accurate as time passes, see this paper. This problem of the changing underlying relationships in the data is called concept drift in the field of machine learning. In our case, concept drift or diachronic semantic shift indicates how the meaning of a concept or a word changes over time.

We intend to monitor the stability of the word embeddings on production overt time(Figure 3). We capitalize on the intrinsic variability of the model from run to run (on the same data) to estimate its background variability. We can then determine if the extra variability between an old vs new model is different in comparison to this background variability.

Figure 3: Statistical method for monitoring word embeddings

More specifically, for two given embeddings A and B, we first sample M words common to both embeddings. We derive M-1 cosine similarity values (e.g. between 1st and 2nd word, 2nd and 3rd word, etc) based on A and again on B. We calculate the Spearman’s correlation (ρ) of the cosine similarity values between A and B, binned based on A into 10 bins (from -1 to 1). This procedure is re- peated P times to generate a bootstrapped distribution of ρ values (per bin). A Gaussian kernel density estimation is then fitted to each distribution. We perform this bootstrapping method on several pairs of embeddings. For embeddings generated from different runs on the same data, the resulting distribution constitutes the null distribution. For embeddings generated from old and new input data (during monitoring), the resulting distribution constitutes the test distribution.

Comparison between the two distributions, as assessed by Jensen-Shannon divergence or Kolmogorov-Smirnov test statistic and with pre-defined thresholds enable monitoring of any substantial changes to the embeddings.

Our custom Word2Vec embeddings are rebuilt periodically with updated records to ensure the model is up-to-date and continues to capture the relevant vocabularies and semantics. For monitoring on production, we apply a statistical approach that measures the word relations and their distributional differences between a given and reference embeddings.

Figure 4: Monitoring word embeddings on production- illustrating that there are no substantial changes to the custom models being monitored

We observe that the variations in the updated embeddings are not substantially different from the null distribution based on intrinsic stochasticity on the same dataset (Figure 4). When we compare to a substantially different word embedding (i.e. Word2Vec based on GoogleNews), we observe a dramatic change and shift in the distribution. Quantitatively, relative to the null distribution, we ob- serve the updated custom Word2vec embeddings and GoogleNews Word2vec have Jensen-Shannon divergence of 0.22 and 0.70, respectively. We evaluate this for the embeddings of all different languages and n-grams, and set empirical thresholds for triggers on production.

Conclusion

We present an end-to-end IR system used in production as a semantic search engine with intelligent keyword expansion and continual model monitoring. It addresses several challenges in handling multi-lingual domain-specific unstructured text and privacy compliance and simplifying the derivation of keywords within a semantic class. Our experiments show that the custom word embeddings are distinct from general-purpose models, can present more relevant search terms, and can be monitored on production based on a novel statistical approach. Overall our proposed solution is applicable for use in other domains.

Thanks to my colleagues Nikki van Ommeren and Boyang Zhao for their contributions.