Don’t ignore the legacy, improve it!

Thaynara Santos
BTG Pactual Developers
8 min readMay 26, 2020

--

Getting processes better with Elasticsearch.

Overview

I currently work in IT Regulatory area of ​​BTG Pactual, which is responsible for the prevention of money laundering and risk classification of clients, where some possible validations can be from verification of the registration status in the federal revenue, involvement in media or national and international restrictive lists, until the identification of divergence in customers behaviors related to their movements and positions.

The main objective of the team is to develop systems that automate existing processes, thinking of better ways to make them more efficient to help compliance analysts to protect the bank against possible illegal involvement.

(If you’re a Harry Potter fan, think of this area as a defense against the dark arts)

Now that we have an overview of the area, let’s go to the problem that gave rise to this post.

Problem

There is a process where is necessary to validate in each transaction carried out by the bank, like transfers, account opening or other operations, if the name present in it appears on the international restrictive list (this list may contain names of terrorists, people wanted by the world and etc).

It may seem like a simple process, we get a name, we check if it is on the list, if yes, so release, and if not, block the transaction, and in fact it is, the problem was not the process but the volume of data and the time it took to validate each name, with the arrival of banking the volume would triple in a short time.

The validation of each name took about 10 seconds to be processed because besides the process being all synchronous, it was still necessary to check each name against the N names present in the international restrictive list, realizing some treatments, because the names could come with different orders, special characters or not significant words, also being necessary apply a similarity algorithm (Jaro-Winkler) to handle misspellings, plus all the data stored in our relational database (SQL Server), which also added more time to the process, because we had latency and constant access to disk carrying a huge amount of data.

Finding the right approach

We needed something that was fast and could handle a large volume of data, so we made a new application from scratch, first, we made the process asynchronous and scalable using some resources from the .net core and docker, improved the resilience using retry policy, then we thought about doing some tests with other similarity algorithms, but we didn’t have good results and then came the idea of ​​doing a proof of concept using Elasticsearch, which in general met all requirements, the search is fast because it stores the results in the cache, where we can also shape document mappings in a personalized way so that it is more performative, in addition to other resources like word treatments.

The only problem was that the similarity algorithm used by Elasticsearch is Levenshtein, because unlike Jaro-Winkler it only takes into account substitutions and not the weight of the typo being closer to the end of the string than the beginning, which is important, because the variations are usually more for surnames than first names, so we decided to use Elasticsearch as a pre-filter, the loading of all the names on the international restrictive list was placed into it and when the search was required it did the hard work of returning the 10 most relevant ones, with this result, we applied the Jaro-Winkler algorithm to find the final matches compatible with the name.

With this solution we were able to reduce the validation of each name from 10 seconds to 55 milliseconds, on average, maintaining the same quality of matches, that we guarantee by testing using the history of checks made.

So, what is it?

Elasticsearch is a distributed and open source data search and analysis engine for all types of data, including textual, numeric, geospatial, structured and unstructured. Elasticsearch is built on Apache Lucene and was first launched in 2010 by Elasticsearch N.V. (now known as Elastic). Known for its simple REST APIs and its distributed nature, speed and scalability, Elasticsearch is the core component of Elastic Stack, a set of open source tools for ingesting, enriching, storing, analyzing and visualizing data. (description taken from the official website)

How can I join this magical world?

Set up the environment

To make a simple proof of concept it is possible to use the docker with Elasticsearch on your local machine using the following command via terminal:

docker run -d -p 9200:9200 -e “discovery.type=single-node” -v esdata:/your_folder/elasticsearch/data docker.elastic.co/elasticsearch/elasticsearch:6.4.2

If you want to upload Elasticsearch together with your application, you can use the following docker-compose:

version: '2'services:  
your-app:
ports:
- "5883:5883"
environment:
- ElasticSearchBaseUrl=http://elasticsearch:9200
- others_environment_variables...
elasticsearch:
image: docker.elastic.co/elasticsearch/elasticsearch:6.4.2
ports:
- "9200:9200"
- "9300:9300"
environment:
- ES_JAVA_OPTS=-Xms512m -Xmx512m
- bootstrap.memory_lock=true
- discovery.type=single-node
- http.cors.allow-credentials=true
- http.cors.allow-headers=X-Requested-With,X-Auth-Token,Content-Type,Content-Length,Authorization
- http.cors.allow-origin=http://your-app:5883
- http.cors.enabled=true
- http.port=9200

(It is good to keep in mind that depending on the criticality and needs of your application, when using Elasticsearch in production, you may need a dedicated cluster or service in the cloud, in our case we use the AWS service).

It is possible to make requests in two ways: using the language library that the application is using or making requests directly on the Elasticsearch endpoint, we chose the last one, because the library ends up being a wrapper for the endpoints and there were more specific queries that needed the flexibility that the endpoints gave.

Create the index

The index name needs to follow some specifications such as lowercase letters only, not contain \, /, *,?, “, <,>, |,`: (Space character), do not start with -, _, +, it cannot be. or .., it cannot be more than 255 bytes.

Below is an example of how to create an index, but calm down, before creating we need to see how to define the settings and mappings.

PUT {{{base_url}}/your_index_name

Define settings

  • number_of_shards — Default 1 (subdivides the index so that each can be stored on a different node in the cluster)
  • number_of_replicas — Default 1 (defines the number of replicas, where each one will be on a different node, to increase resilience if any node fails)
  • analysis — Where filters or analyzers are declared.

Define mappings

Below are some types of mappings, those marked in bold are those used in the solution presented and it is possible to find out more about the description of each one in the official Elasticsearch documentation.

(If they are not defined manually, Elasticsearch will define, but it is advisable to make this definition so that the index better meets the requirements of your application acting in a more performative way)

  • Data types: core, text ou keyword, long, integer, short, byte, double, float, half_float, scaled_float, date, date_nanos, boolean, binary, integer_range, float_range, long_range, double_range, date_range, complex, geo, specialized.
  • Analyzers: custom, standard, simple, whitespace, stop, keyword, pattern, language, fingerprint.
  • Tokenizers: word-oriented, standard, letter, lowercase, whitespace, UAX URL Email, classic, thai, partial word, structured.

(In our case, tokenizing was very useful, because we had the problem of names not being in the expected order to compare the similarity, for example, a name could come as “Maria Silva Cardoso” or “Cardoso Silva, Maria” or “Maria Cardoso” and etc, so these combinations were made in the code after the records were retrieved, now Elasticsearch already does the search taking into account the name broken into smaller parts.)

  • Filters (more than 45 types..): lowercase, asciifolding, unique, pattern_replace, stop.

(In our case, filters were important for ignoring things that did not have much significance for the calculation of similarity, such as “ltda” in names of legal entities, special characters, accents or words with less relevance in the context.)

Create your first document

Elasticsearch is a distributed document store. Instead of storing information as rows of columnar data, Elasticsearch stores complex data structures that have been serialized as JSON documents. When you have multiple Elasticsearch nodes in a cluster, stored documents are distributed across the cluster and can be accessed immediately from any node. (for more)

To create your first document is very simple, you just need to do a post to your index passing a json with your properties, as bellow:

POST {{{base_url}}/your_index_name/_doc

Do your own search

Elasticsearch provides a full Query DSL (Domain Specific Language) based on JSON to define queries. (for more)

The following examples are types of simple searches that can be made using the Query DSL:

GET {{{base_url}}/your_index_name/_search
  • Types: match, match_all, match_phrase, multi_match, dis_max com match_phrase_prefix.
  • With or without operators.
  • Fuzziness (Levenshtein Distance):
    0, 1, 2
    AUTO: Generates the distance based on the length of the term
    0..2: Search for exactly the term
    3..5: Search for the term with an editability
    > 5: Search the term with two editing possibilities

Final considerations

Attention to the size of the nodes that your application will need, since Elasticsearch has a maximum number of accepted requests, so if it reaches the limit it may start to have unexpected or error behaviors.

Use when the search is broader, like keywords in a sentence or similar words, for exact searches like key-value a Redis would make more sense.

In general, we see the use of Elasticsearch on e-commerce sites as auto-complete in search bars, or to find related products, it can also be used to improve the form that logs are handled and stored with Logstash, or as a tool for BI, to get insights from the data through Kibana, I didn’t cover the topic in this post but is possible make aggregations of data as well.

This post brought a more specific case, but I hope it also added knowledge bringing another way in which Elasticsearch can be fitted for performance gains.

If you have any doubt, suggestion or advice feel free to contact us :)

--

--