Case Study: kdb+ Anymap with CMO data

Published in

Version 1

13 min readJul 7, 2023

The code used to generate the table in this blog can be found here.

Recently we’ve been working with a client migrating a very large Collateralized Mortgage Obligation (CMO) file system into a new kdb+ database. This would normally be a straightforward task, had it not been for the complex structuring of the CMO data, and how the client intended to use it. The data came from several vendors and therefore contained hundreds of different field names that were unique to their source — only some of which were common or at least easily normalized. One option was to set up a mapping that would merge some of the most common data under one name, but that would take a lot of development time to set up and maintain. Plus, any new fields published won’t be captured if they are unmapped. Another option would be to capture the raw upstream feed into a filesystem, and have the option of replaying data if and when new fields were added — but that’s a little vulgar…

So, considering they would only occasionally need to drill down into the individual mortgages, we realised we had a good use case for anymap. Anymap (added in v3.6) allows you to store nested lists of complex structured data into one column — perfect for the individual mortgages of each CMO. This case study explains the benefits and drawbacks of anymap, and how we used it to provide a solution that addressed all our clients’ requirements, which were:

Minimise storage costs
Efficiently access data when needed
Regularly increase data capture using minimal development resources

Case Study: CMO Database

This section will cover how a CMO is structured, what our dataset contains, and how we generated that dataset.

A Collateralized Mortgage Obligation (CMO) is:

a security made up of several pools of mortgages
each pool is a collection of mortgages that share comparable risk profiles

As shown in Table 1 above, a single row will show data for one CMO, however the Collateral Bag column can contain hundreds of thousands of mortgages per CMO. That’s when we realised that anymap was a viable option for this column, as we could keep the structure shown in Table 1. Given the user requirements, the Collateral Bag column would be excluded from most queries, but it was readily available to be accessed as and when needed (see “Querying an Anymap Column” for how to do this).

Our CMO Table

To show how we have been working with this data, we generated a dummy dataset (same format as Table 1 above). Each row represents a single CMO, with some columns aggregating the data from the pools of mortgages (Table 3 provides full details of these columns). For this case study, we will focus on the collateral bag column (collBag), which holds all the individual mortgages within the pools of a CMO and that we will store as type anymap. Table 2 below describes each column.

*Table 2: Description of reference table columns*

The full details of each key that a mortgage can have are provided in Table 3 below. For our example, we use seven common keys which are shared across all mortgages, and a randomized subset of the rest. In practice there may be no overlap in keys, and hundreds, if not thousands, of varying keys. The number of keys would depend on the number of data vendors subscribed to.

Table 3: *Description of individual mortgages’ dictionary keys*

How Anymap Works

Previously, it wasn’t possible to memory map on-disk data of mixed types/nested lists. Now, anymap supports all depths of mixed/nested data on-disk to be mapped, allowing us to dig into those complex structures without blowing up memory.

To demonstrate this, we observed the kdb+ memory stats (.Q.w) while querying an on-disk anymap column. We captured the memory map (`mmap) usage before, during and after the query, as well as the actual physical memory used.

// simple query accessing the anymap column, checking `mmap and during the query
q)`before`during`after!(.Q.w[]`mmap;exec first mmap from select mmap:.Q.w[]`mmap,collBag from cmoRef where date=2020.09.14;.Q.w[]`mmap)
before| 0
during| 9570354333
after | 0
 
// checking physical memory used from the heap during the query
q)beforeQueryMemUsed:.Q.w[]`used
q)memUsedDuring:exec first used from select collBag,used:.Q.w[]`used from cmoRef where date=2020.09.14
q)afterQueryMemUsed:.Q.w[]`used
// normalizing the stats by reducing before memory used to 0
q)`before`during`after!(beforeQueryMemUsed;memUsedDuring;afterQueryMemUsed)-beforeQueryMemUsed
before| 0
during| 4576
after | 1040

As shown in Table 4 below, there was 9127MB mapped into memory during the query (not physically copied into memory). The memory used stats are negligible — clearly highlighting how kdb+ did not have to request more heap memory from the OS, despite querying a huge column on-disk.

Anymap Write Operator

From kdb+ v3.6 onwards, any mixed list column saved to disk would automatically be of type 77h (anymap) meaning that the column itself is mappable. However, any nested data within the column would still be un-mappable, meaning if you were reaching down into the nested data, accessing or filtering, it would still be copying to the heap. To handle this type of data structure, the anymap write operator 1: needs to be applied to the column file on disk. Applying this to the column file results in all the nested structures within the column being mappable, no matter how deep the complex data is, and can be accessed without having to copy the whole column into memory.

q)mixedNestedList:((`a`b;"c");(10i;11f);(2020.10.01;14:12:00.000);(`foo;0b))
q)save`:mixedNestedList;                 // save to disk as normal
q)type get`:mixedNestedList              // mixed list is stored as type anymap
77h
q)type each get`:mixedNestedList         // ...but individual elements of list are not type anymap
0 0 0 0h
 
q)`:mixedNestedList 1: mixedNestedList;  // save to disk using 1:
q)type get`:mixedNestedList
77h
q)type each get`:mixedNestedList         // ...now individual list elements are type anymap
77 77 77 77h

Using 1: to write a mixed nested list or column to disk causes kdb+ to create three files:

mixedNestedList: Contains all the offsets into the # file, allowing kdb+ to memory map the data
mixedNestedList#: Stores all the data collapsed together in q binary form
mixedNestedList##: Created only if the nested data contains any symbols
- Similar concept to the sym file created with splayed/partitioned kdb+ databases
- All symbols from this file are copied into physical memory when the list/column is queried
- See the “Drawbacks” section for more on this

When the list/column on-disk is accessed, the files merge to generate the kdb+ entity. kdb+ can use specific offsets to deserialize portions of the flattened binary into a kdb+ structure, providing random access to the nested data.

Generate Dummy Dataset

We used the 1: operator to save the Collateral Bag column to disk, so that not only the column was mappable, but each element within the column was also mappable. Below, I’ve highlighted the write-to-disk logic of the script, but the full code used to generate the dummy CMO dataset can be found at https://gitlab.com/EmanSantianoVersion1/kdb-anymap-case-study.

writeToDisk:{[db;t;data;dt](` sv .Q.par[hsym db;dt;t],`) upsert .Q.en[hsym db;data];}
 
applyAnymap:{[db;t;anymapC;dt]p 1: get p:` sv .Q.par[hsym db;dt;t],anymapC;};
 
// Save data to disk
writeToDisk [`anymapDB;`cmoRefComp;cmoRef;]each dates;
 
// Apply type anymap to the collBag column
applyAnymap[`anymapDB;`cmoRef;`collBag]each dates;

Querying an Anymap Column

In this section we will give examples of when it may be useful to drill down into the individual mortgage data to have more understanding of what the CMO contains, which can be particularly useful for risk analysis.

Sample Queries

Below are sample queries showing scenarios where the underlying mortgage data can be made useful.

Fetching CMOs with mortgages in certain US States:

Search for any underlying mortgages that have state = `NY or `FL
Return those CMOs and their id, currency, rating, gross coupon and average maturity date

select id,currency,rating,grossCoupon,maturityDate
from cmoRef where date=2020.09.14,
   any each `NY`FL in/:collBag[;;`state]

Fetching CMOs with mortgages that have over $500k remaining to pay:

Search for any underlying mortgages that have over $500,000 left to pay
Return those CMOs and their id, average current amount, average original amount and average mortgage age

select id,currentAmount,originalAmount,loanAge
    from cmoRef where date=2020.09.15,
    any each collBag[;;`currentAmount]>500000

Fetching CMOs with >50% 120-day-delinquent mortgages:

Count how many of each delinquency status there is per CMO
Convert the counts into percentages relative to each other per CMO
Any CMOs that have more than 50% of mortgages with 120+ day delinquency are flagged

countLoanKeys:{key[g]!count each value g:group x};
calcPct:{100*x%sum x};
getPctDict:{key[x]!calcPct[value x]};
 
select defaultingCMOs:id where
    (getPctDict'[countLoanKeys each collBag[;;`delinquencyStatus]]
    )[;`$"120+"]>50f                    // flag CMOs with >50% nearly-delinquent mortgages
    from cmoRef where date=2020.09.16

Alternative Storage Options

Now that we’ve explored the anymap solution in detail, we also wanted to consider other alternative storage solutions before reaching a sound decision. One option we considered (which is probably the most obvious one you’ve been thinking of this whole time) is to just pivot out the nested mortgage data so that each of their keys are their own columns in the table.

Well, we tried that… and the on-disk size blew up. Mainly because there are several upstream vendors providing the CMO data and they each have their own key naming conventions. Normalising those keys by mapping them to standard column names will remove the schema flexibility that we wanted and can also become an unpleasant overhead for developers to maintain as more and more vendors could be sourced in future. Plus, it doesn’t anticipate current vendors adding new keys, which may not be captured if we’re not notified. However, to get a full picture, we tested the impact of flattening out our dataset into a wider table where the keys are pivoted into columns. The code below was added to our dummy dataset code (see appendix) to simulate multiple vendors with varying keys:

numVendors:3;
cmoRef:update
    {v:first 1?"vendor",/:string[1+til numVendors],\:"_";      // generate random vendors and prefix to all keys
      {(``secId`pool`rating,`$y,/:string key[x] except ``secId`pool`rating)!value x}[;v]each x
    }'[collBag]
    from cmoRef;

Widened Flat Table

Pivot out the individual mortgages’ keys so that they are individual columns, creating a much wider table which may not be user-friendly if it gets very big.

flattenRow:{[row] (count[r]#enlist `collBag _row),'r:(uj/){enlist (`$"loan_",/:string[key x])!value x} each row`collBag};
flattenTab:{[tab] (uj/)flattenRow each tab};
flatPools:flattenTab[res];

Key-value Table

Split the CMO reference table into one without the Collateral Bag column, and a new table with CMO ID and the Collateral Bag data extended into two columns — one showing the key, and another showing the value.

keyValTabFn:{[row]
    r:(uj/){
        ([]secId:count[x]#enlist x[`secId];                 // CMO IDs
        collBagKey:key x;                                   // collateral bag keys
        collBagVal:{$[10h~type x;x;string x]}each value x)  // collateral bag values as strings
        }each row`collBag;
    :(count[r]#delete collBag from enlist row),'r;
    };
 
kvTab:raze keyValTabFn each select id,collBag from cmoRef;
basicCmoRef:delete collBag from cmoRef;

Storage Implications

The sizes of each storage option (both compressed and uncompressed), depending on number of vendors, are shown in Figures 1 and 2 below. There is little to no change in storage size for the anymap and key-value options when there is an increase in vendors, however there is a significant increase in storage costs with more vendors for the widened flat option. The anymap database is smaller in size than the key-value database, and in the long-term is more stable, or predictable, than the flat pools database. This is why the design stage of a kdb+ database is extremely important — had an increase in future vendors not been considered then it could cost a huge amount of resources at a later stage of the project if an unsuitable option was chosen.

*Figure 1: On-disk size of each uncompressed storage option with increasing vendors*

*Figure 2: On-disk size of each compressed storage option with increasing vendors*

Query Performance

We wrote queries for the flat pools database and the key-value database (see code at end of this section) that fetched the same data as those in the “Querying an Anymap Column” section, and measured their execution times. Figures 3 and 4 below show how they performed against the compressed and uncompressed databases.

*Figure 3: Query execution times per uncompressed database*

*Figure 4: Query execution times per compressed database*

Analysing Figure 3 and 4 above, the query latency for the anymap database trails behind both the flat pool and key-value databases, both compressed and un-compressed. The flat pools database, while wide with many nulls, is the most simple structure and thus performed the best. The key-value database initially performed poorly, until we decided to sort the table storing the Collateral Bag data. The table was sorted on the collBagKey column before applying a parted attribute. The queries improved after this, similar in performance of the flat pools queries.

Comparing against Anymap Performance

Here we will analyse anymap’s performance against alternatives where a form of the nested structure would still be kept. The anymap solution we’ve discussed in the case study is unique, in the sense that it wasn’t “possible” to store this form of nested data in kdb+ previously. This section will shed further light on what anymap now makes possible in kdb+.

Serializing & Deserializing on Access

Another format of the nested data structure can be maintained. kdb+ provides an operator that allows you to serialize any data structure into its byte representation, taking the form of a list of bytes. There is also an operator that lets you reverse this, deserializing back into a kdb+ entity. Simple lists, such as lists of bytes, can be memory mapped, so let’s experiment with storing the nested data in its byte form and see what the performance is like deserializing on access.

writeToDisk:{[db;t;data;dt](` sv .Q.par[hsym db;dt;t],`) upsert .Q.en[hsym db;data];}
// write to disk creating a new column collBagSerialized
writeToDisk[`anymapDB;`cmoRef;update collBagSerialized:-8!'collBag from cmoRef;2020.09.18]
 
// example of deserializing using one of sample queries
select id,currency,rating,grossCoupon,maturityDate
    from cmoRef where date=2020.09.14,
    any each `NY`FL in/:(-9!'collBagSerialized)[;;`state]

*Table* 5: Execution time and memory usage: Anymap — deserializing

Because the string of bytes must be deserialized before you can search the nested data, the overhead results in poor performance. If you were filtering on other fields in the table, and you wanted to de-serialize on the output, the performance will improve. We can demonstrate how this compares to the non-serialized format.

// query non-serialized
select collBag from cmoRef where date=2020.09.29,id like "cmo_8286"
// query deserializing
select -9!'collBagSerialized from cmoRef where date=2020.09.29,id like "cmo_8286"

*Table 6: Execution time and memory usage filtering on one id: Anymap — deserializing*

Filtering for a single CMO id shows that deserializing the filtered output can perform quite well. The column list of bytes is mappable, and the overhead deserializing the single list of bytes back to a list of dictionaries is small. However, when the number of rows returned to be deserialized increases, the performance begins to drop off drastically.

*Figure 5: Memory used by each query with increasing number or rows filtered on return*

We can see in Figure 5 above that memory remains constant even with increasing rows when using the anymap column, which we expect as the column is mappable. The serialized column on the other hand, begins to perform poorly with only a small increase in rows. Even though the column list of bytes is mappable, the overhead of deserializing an increasing volume of nested data becomes evident very quickly.

Summary

Benefits

Throughout the case study, some of the main benefits we found were:

Schema flexibility:

▹Consuming from multiple data vendors, but not having to map or convert vendor field naming conventions

▹Developer resources can spend time on more important tasks and not have to make configuration or schema updates to handle any upstream vendor changes.

▹Scalability — as demonstrated when additional vendors were added to the dummy dataset in Figure 1.

Storage:

▹In practice there would be many more varying fields associated with different mortgage types — including them all was outside the scope of our dataset. Anymap can comfortably handle increases in non-repeating data fields — it has next to no storage implications. Take for example a one-off field that would only be present in a small number of mortgages. Anymap would only need to use storage for where these field appear. Flattening out the data on the other hand, would result in a full column being added to the tables schema, just for this small number of values — the rest of the column being populated with nulls.

Drawbacks

Some of the drawbacks encountered were:

Symbol datatype:

▹Initially we decided to cast each underlying mortgage ID field to type symbol. The thinking behind this was opting for faster retrieval when filtering on the ID — even though we acknowledged these were non-repeating. We discovered that unlike a sym file in a splayed/partitioned database, anymap’s “##” file was created as it’s “enumerated domain”. For every data access, the “##” file would “de-enumerate”, meaning it would copy its contents to the heap rather than a mapping. We saw significant performance overheads with this approach and quickly reverted to storing the IDs as strings.

Query complexity:

▹Issues occurred with nested fields that don’t appear in every mortgage (see “Querying an Anymap Column” section). For example, attempting to get the average loanAge of all the underlying mortgages of a CMO. If any of the mortgages happened to not have this key, a null value will be populated in the returned list. This null value will depend on what the first data type was in the mortgage’s dictionary. Using kdb+ operators on a mixed type list will not work — so to handle this, every mortgage dictionary had a generic null (::) value pre-appended. Then, queries can leverage a helper function that can “fill” the generic null types with the appropriate null data type so that the list becomes unform.

Conclusion

This case study has outlined a suitable use case for storing complex nested data in kdb+ using the anymap datatype. Client requirements favoured schema flexibility, and minimal developer resources for any future upstream vendor changes. This, coupled with the frequency the client wanted to access the underlying mortgage data, opened the door for anymap as a viable solution.

Query latency was not the number one priority. But, while the case study found that query performance lags behind the alternative schema solutions, the speed of data retrieval of the underlying loans still meet the needs of the client.

With the data un-compressed, the anymap storage overhead outperforms the alternatives we’ve discussed, but once compressed the three have relatively similar storage costs. The alternative options compress better as they are populated with many nulls, and repeating data. However, anymap triumphs here due to its flexibility and scalability.

About the Authors:

This blog was written in collaboration by the Version 1 kdb+ team.