How restructuring data can help you increase performance
We are all aware that our performances are affected by the way we structure our data and the way we access it. Wrong decisions can lead to slow or even unusable functionalities. Good decisions can make functionalities faster and/or easier to maintain. Whenever we decide to create functionality, we need to consider a few things concerning our data:
• What data will be used
• Where that data will be stored
• How and how much of it we will access
• How data will be structured
I dare say that the last bullet might just be the most important. It is heavily influenced by the previous three, yet it is also a challenge in itself. Good planning will require us to go back and forth between the first three bullets and the fourth one several times before we start coding. The more thought we give to our task, the easier it will be to maintain the code in the future. And, of course, changes will be less costly.
In this post, we will present the benefits of such an approach by using a real-life example the author was fortunate to be involved in. We will use the example of payment card issuer ranges.
We are part of the Levi9 company. Levi9 is a nearshore software development company, with an HQ in Amsterdam and a Delivery Center in Serbia. The team this post refers to is quite diverse, coming from many walks of software development and we are working with a customer in the Netherlands.
The customer in question is a Communication Platform as a Service (one of many). They decided to get in on the action of credit card payments. Most of the code was Java but for a few years now, they have been migrating some of their code to Go. As this was a whole new functionality, it has been decided at the time that this part of the system would be fully implemented in Go.
One part of the system is a place where issuer identification ranges are stored. What is an issuer identification range? It is a range of numbers in which all cards are of the same scheme, type, level, and issuer. An issuer is a financial institution that issues payment cards. So, important data here is as follows (with some examples):
• Card scheme (Mastercard, VISA)
• Card type (Mastercard credit / debit, Maestro, Cirrus, VISA electron)
• Card level (Mastercard credit classic / gold / platinum)
• Issuer name (e.g. Banca Intesa A.D.)
• Issuer country (e.g. Republic of Serbia)
• The first number in the range
• The last number in the range
Numbers in a range are 19 digits long. The first 16 are PAN (Primary Account Number) and are followed by 3-digit CVV (Card Verification Value). Now that we know this, let us see what was found, what the problem was, and how it was improved.
What was in the code
The code was implemented in Go, a very fun programming language to code in. Data was organized in a tree structure. There was a root node connected to other nodes. The first range encountered would form the root. All subsequent ranges would branch out from the nodes that are already part of the tree. Insertion starts from the root. If the inserted range is after the node range, a new node is created to the right. If a node already exists to the right, the inserting range is compared against that node in the same way.
This structure allows great versatility and is easy to implement. It does have a major flaw, though. The file containing ranges data has ranges already sorted. The resulting structure is a list and not a tree. It took about 5 minutes to create a tree from the file. Determining to which range the card number belongs lasted 10 μs ~ 100 ms.
Time to try a different approach
To compare all the approaches, we will remove all the checks that make sure ranges do not overlap and range data is reduced to a single small string. In this way, we will compare just the effects of the chosen structure. Go language has a great feature called benchmark tests. Test code will be run several times and average execution times, number of memory allocations, amount of memory allocated, etc. will be shown as a result. In this post, we will focus on the execution time:
Times of the simplified version of the original code:
* Ps is picosecond or 10–12 seconds
* Fs is femtosecond or 10–15 seconds
Since the data we will use is already sorted, why don’t we just use an array? We don’t even need to implement a fancy search algorithm. A binary search in the array should work just fine. All figured out, right?
No. There are around 96k ranges for Mastercard alone. An array that large will take a long time to sort. If we use a simple sorting algorithm with two for loops, the larger the arrays, the sorting time will grow exponentially.
The tables below will show values obtained from benchmark tests here
Sorting can also be accelerated by creating a new array and inserting elements (from the original array) one by one in a proper place (in a new array). That way we sort the array as we insert elements into it.
How do we improve from here?
The first thing we needed to do is to understand the data we were using. What is the structure of PAN? Well, the first digit is an MII (Major Industry Identifier). The following 5 to 7 digits are the card issuing institution. The rest are account numbers allocated by the card issuer. We are guaranteed that all numbers in a range will have the same first 6 digits.
Now that we know this, we can create our specialized version of a multi-level map (will be referred to as a multimap in the rest of the text).
The first 2 digits should be the map key. Another map should be the map value. The second 2 digits should be the key of that map and another map should be the value. The fifth and sixth digits should be the key of that map and a sorted array should be the value. Roughly, the structure looks like this:
• rootMap is map[first_2_digits]subMap1
• subMap1 is map[second_2_digits]subMap2
• subMap2 is map[third_2_digits]rangeArray
• rangeArray is a sorted array of ranges
Why partition the structure so much? To make arrays smaller. No array will have more than a thousand elements and most will have about 50 to 130. With arrays so small, we can use the simplest of sorting and search algorithms and execution times will not differ much. Even with all the safety checks and using all the data from the file, 96k ranges are inserted all in just 0.2 seconds. A range search takes about 10 μs.
Not only that, if array elements are not range data but a pointer to range data, cloning and editing of the overall structure are both fast and simple. And the ease of maintenance and/or debugging certainly matters. Large arrays are difficult to inspect during a debugging session. Having a lot of pointers can have a detrimental effect on the garbage collector, but an overall improvement in speed and maintainability more than makes up for it. This can be seen from benchmark tests here.
Why not simply randomize the tree instead?
We can see that tree generation and read time are like times of the multimap. We can simply load the entire input into memory, shuffle it thoroughly and build a tree from that. Instead of spending time planning and testing new structures, we could have started coding immediately.
The main reason is the need to shuffle. Go already has a good and fast function of shuffling elements in an array, but that requires all the data to be in one single array. Try debugging and inspecting an array with 100k elements (which are pointers to other structures btw) and you will understand just how much more difficult development and maintenance will be.
We could also implement shuffling/exchanging between smaller arrays and/or maps. By the time we plan and implement that, we can develop another data structure. If we understand the data we will be using, we can develop a fast and maintainable structure. Which brings us to maintainability.
Some ranges become deactivated over time. A bank might close, for instance. All the data needs to be reinserted in a way that we don’t end up with very long branches again. The fastest way should be to collect all the elements, filter out the ones that are no longer valid and rebuild the tree. Again, there is a problem with a large array/map.
Any way you slice it, we are left with a need to shuffle the elements to make a tree distribution usable. The multimap implementation we created simply does not need such safeguards. It is, of course, not a silver bullet that solves every problem. It is merely an example to demonstrate the benefits of understanding your data and tailoring your structures to the characteristics of your data.
In most cases, you will improve your data structures if your data can be partitioned. Smaller data sets are easier and faster to deal with. The trick is to know HOW to partition it.
For that, we figure out the features and recognizable patterns in our data. Then we know how to split our data and “what we can get away with”. In our case, we decided to have a hundred more small maps and the tradeoff was a thousandfold (1000x) increase in speed.
Hope this real-life example and experience from Levi9 would be a reminder that devoting time to understanding the data is always a win-win situation. And we hope these comparison tests will help you reach your own solution faster.
Senior Software Developer @ Levi9Serbia