To vectorise or not to vectorise?

In this article we discuss the effect of vectorising the input data to ActivePivot and explore the impact on the memory usage, load and query times.

One of the strengths and benefits of ActivePivot over the years has been its ability to load, store, aggregate and perform calculations on vectors of data. The most common example of this on projects is for computing VaR-like measures. Typically, ActivePivot would load a P&L vector (e.g. containing 250 doubles for one year of historical returns/simulations), aggregate this to the required level and perform the VaR or other calculations. We have had clients doing this for years, and we do it very efficiently and quickly.

What do we mean by a vector? In programming terms we are talking about an array of values: for example, [100, 101, 102.5, 99.5, … ,102] is what we would refer to as a vector of doubles. The vector size can vary from the small 10s, 260 for the ES calculations in FRTB IMA, to 10s of thousands for PFE Credit Risk use cases and even into the millions under IMA DRC. ActivePivot efficiently stores vectors in direct memory, not on the heap. There are also highly optimised methods as part of the ActivePivot’s API which can be used to calculate measures such as VaR. Other technologies, such as relational and NoSql databases and not designed or optimised to store arrays or vectors of numbers. Instead, they store a single value per row which leads to issues we often see when this pattern is replicated in ActivePivot.

Although many projects use and store their data as vectors, we see many projects that do not. Unfortunately and as a result, these projects potentially run into difficulties with memory usage and various timings. So, in order to promote the benefits of vectorisation, we decided to do an experiment based on a use case similar to a recent project, and also similar to what we are currently building into the Market Risk Accelerator for sensitivity analysis.

In this toy example, we want to calculate and sum our Delta sensitivities across various time buckets. We assume a simple model: each trade has 100 buckets (from 1D to 100Y), and thus 100 sensitivities. To calculate the total exposure, our Delta.SUM for each trade is the sum of all 100 buckets. So, in our non vectorised example, we have 100 sensitivities for each value eg:

In order to vectorise the data, we run a simple ETL job which produces the following output:

100 rows is now 1 row. We have reduced the number of rows or “facts” in our datastore by a factor of 100.

Note: on a real ActivePivot project, not just a toy example, we would look to move the buckets to a separate store.

What about changing the store config? Well that is very easy:

.withField(DELTA_VALUE, DOUBLE)

Becomes

.withVectorField(DELTA_VALUES, DOUBLE)

What about about the measures we have in the cube? Previously I was calculating my Delta.SUM measure using CoPPeR by doing:

protected static Dataset sensiMeasures(final BuildingContext context) {
return context.withFormatter(DOUBLE_FORMATTER).createDatasetFromFacts()
.agg(
sum(DELTA_VALUE).as(DELTA_SUM)
).publish();
}

As a vector this calculation changes slightly,

protected static Dataset sensiMeasuresFromVector(final BuildingContext context) {
return context
.createDatasetFromFacts()
.agg(sum(DELTA_VALUES).as(DELTA_VALUES_VECTOR))
.withColumn(DELTA_SUM, col(DELTA_VALUES_VECTOR).map((IVector v) -> v.sumDouble()))
.agg(
sum(DELTA_VALUES_VECTOR)
sum(DELTA_SUM).withFormatter(DOUBLE_FORMATTER)
);
}

Results

They say a picture is worth a thousand words:

  • Load times decreased by a factor of 8
  • Memory usage down by a fact of 6
  • Query times 15x quicker

The reduction in memory and load times can be seen quite clearly and is expected to happen. By vectorising the data, the underlying file size dropped by a similar factor to the memory size. It is easy to see why the file size drops. Say a row of data has 50 attributes and 1 numerical field. When vectorizing this field into vectors of size 100, you save 99x50 attributes on the rows — but also add some attributes to a column on each row. Of course the file size then becomes a lot smaller.

If the file size is smaller then there are less bytes to transfer from disk drive to memory, and the process will be quicker. Similarly, since the data sizes are smaller, they will occupy less space in memory. The secondary effect is any indices, dictionaries and other internal ActivePivot structures are a lot smaller in size — resulting in performance improvements. ActivePivot also efficiently stores vectors off the heap in direct memory as mentioned earlier.

When it comes to query times, the main improvement comes from the “reduction” mapping. When aggregating data, you have to read some attributes of a fact to figure out where to aggregate the data. That’s called reduction. Say you’re summing up the Delta values by currency: for each fact, you need to read the value of the currency attribute in the currency column, map it in the results dictionary to find the index in the result column at which the Delta value will be aggregated, and finally perform the sum. Although we still add the same amount of numbers together, you can see from this example that summing up doubles is actually an almost negligible part of the reduction.

Another improvement could be that at query time, the CPU may be vectorising some of the vector aggregations and making use of modern processing architecture. This is something which will become more powerful once Project Panama comes to fruition.

Conclusions

The results from vectorising the data speak for themselves. Projects which perform analysis across different time buckets like sensitivities are a perfect candidate for benefiting from this change in data structure. Because ActivePivot is an in-memory database, reducing the amount of data stored in memory means we can then store more data. It means we could have more business dates online for analysis, or new products and desks available in the cube. And as we have seen, query times have become shorter — not longer!

If you’re interested in the improvements vectors can bring to your project — as well as other performance optimisation techniques, speak to your account manager about attending our new Performance Optimisation course.