Schur Complementary Portfolios — A Unification of Machine Learning and Optimization-Based Allocation

Published in

Geek Culture

19 min readNov 21, 2022

This post considers both “modern” and “machine learning” portfolio approaches, and I show that they are sometimes the same. This viewpoint yields new, principled top-down allocation methodologies including a promising new approach inspired by Hierarchical Risk Parity.

Schur complementary portfolios use B to alter A and D used in the recursive step

Goals

I summarize Hierarchical Risk Parity (HRP) developed by Marcos Lopez de Prado in 2016.
I introduce a new family of recursive allocation methodologies that are HRP-inspired (use reordering) but are characterized by a financially motivated augmentation of covariance sub-matrices.

I provide code for those looking to experiment on a partial continuum that starts at HRP and stretches back towards optimization — and evidence that this exploration can be economically significant at an institutional scale.

Along the way, I mention a financial interpretation of the solution of linear systems of equations as minimum variance portfolios. This is intended to help the reader design variations on my theme.

I emphasize that as a pragmatic matter, this is not a complicated “fix” for top-down allocation at all — one merely needs to modify a couple of matrices.

Ideas presented for portfolio construction are also relevant to model ensembles (the connection was noted in Optimizing a Portfolio of Models).

Hierarchical Risk Parity

In this video introducing Hierarchical Risk Parity (HRP) the method is described as “one of the biggest breakthroughs in applying Machine Learning portfolio optimization”. It helped garner Lopez de Prado a Quant of the Year Award (or two). The video provides a better explanation than I can manage in this format, and it focuses on things I will leave out.

Let us merely say that after a reordering of assets, a covariance matrix can be written:

Block covariance matrix for assets split into two groups, usually after a seriation of some kind

I skip over plenty with the phrase “reordering of assets”. Those interested in the particular choices of asset reordering are referred to the video or the original paper by Lopez de Prado titled Building Diversified Portfolios that Outperform Out of Sample (SSRN).

I am brief on the matter of reordering because my contribution is orthogonal (there might be half a pun there, we’ll see) and because there are simply too many ways to permute items in a list so that neighbors in the new ordering are more like each other than those further away.

This is termed seriation in statistics and the history goes back at least as far as archeologists inferring the ordering of rock strata. There are references at the end of this post to papers that suggest variations on the theme, such as the seriation survey by Liiv and the graphical survey by Marti et al.

However it is done, the reordering is critical because in top-down approaches one proceeds recursively to define portfolio weights w in a lossy manner:

Hierarchical Portfolio Allocation using a scalar metric nu( ) for inter-group allocation

I say lossy because there is something conspicuously absent in the top-down formula: the matrix B has disappeared. That discarding of information is ameliorated by the reordering of assets, but not eliminated — a matter I shall return to as it is the raison d’etre for the new method I propose.

Here the denominators are scalar measures of inverse diversity (speaking loosely) and the w(A) and w(D) are vectors summing to unity. In words, we shall decide how much to split our investment between the first group and the second using some measure of fitness — such as inverse portfolio variance — and then allocate within each group.

The inter-group allocation 1/mu(A) : 1/ mu(D) might employ the variance of some theoretical sub-portfolio that is not too inconvenient to compute (such as one that ignores off-diagonal covariance entries). It doesn’t have to be the same as the variance that is eventually arrived at recursively. There are variations, but we’ll set this mostly aside because, in this note, you’ll see that I am less interested in the choice of fitness function mu() than I am in the matrix arguments A and D that get passed to it.

We can also decide that w() will terminate when the dimension is small, using some other portfolio technique, rather than continuing to split forever.

Motivating Hierarchical Risk Parity

A banal mathematical observation might advance the discussion. Notice that in the case of only two assets, the matrices A and D are scalars, and the sub-allocations w(A)=w(B)=[1] must be trivial. Each asset is its own group and we allocate between the two inversely proportional to the measure of unfitness nu(), which, again for concreteness, can be the variance of the respective assets.

Let’s contrast with optimization. Recall that if we seek a minimum variance long-short portfolio with weights summing to unity, then it is well appreciated that the portfolio will be proportional to the inverse of the covariance matrix:

If the covariance is diagonal, this is merely an extension of the n=2 case. Of course, it generally isn’t diagonal, but if a change of basis existed to make it so, we could work in that other basis instead, and again, the allocation would be almost as simple as the n=2 example. Then, naturally, we would need to translate back so we know what to allocate to each asset.

That’s standard fare, and so are the difficulties. Any time one attempts an inversion of a matrix there’s a chance the solution is unstable. The global covariance matrix is always noisy, and usually rank-deficient. That change of basis I speak of is a mirage.

I personally like the description of HRP as a quasi-diagonalization, as that emphasizes the counterpoint: diagonalization itself. In that comparison, we know that there are numerous problems with diagonalization (or taking the inverse of a matrix by whatever means) and we don’t know what the covariance matrix really is, to begin with.

An example makes the point. Take:

then the minimum variance long-short portfolio with weights summing to unity is

However, if we make a tiny modification to our covariance estimate, by multiplying the off-diagonal entries by 0.97, then the optimal portfolio thus defined changes dramatically:

No serious practitioner would want to lean heavily on the former portfolio, for this reason.

Resorting to top-down allocation, if I may use the term “resorting”, isn’t the only way out of this Modern Portfolio Theory jam. I have compiled a partial list of papers that come at this problem from different angles here, and the usual suspects involve shrinkage, bootstrapping, robust estimators (e.g. Huber means), and other modifications of naively estimated covariance or portfolio optimization. Little of this will overly surprise the statistical reader.

It is striking that HRP side-steps this instability altogether through top-down allocation. In the original paper (SSRN) the author presents evidence that the approach will outperform optimization out of sample — although that optimal portfolio was very concentrated. Readers who are skeptical of whether this outperformance will survive a more involved set of comparisons are welcome to have at it. I maintain a long list of “managers” that might compete with HRP, but that is not the focus of this post.

Interpreting HRP as a Basis Limitation

To stay on point I would rather turn to the question of the interpretation of HRP and its inherent motivation. There are, no doubt, many ways to come at this. But on the topic of seriation, here is one possible nesting:

“let’s try to diagonalize using only a very restricted class of matrices”

Quasi-diagonalization (Figure from Hudson and Thames)

In HRP (Post-Modern Portfolio Theory?) we are subject to a constraint on the set of ways we can change the basis. Instead of the usual set of matrices used to scale and rotate we limit ourselves to zeros and ones (and then to permutation matrices). This self-imposed limitation can prevent us from exploring too far into dubious regions of the portfolio space. And there is a very clear message in the original paper by Ledoit and Wolf: Honey I Shrunk the Sample Covariance Matrix (link).

Parade de cirque with detail — Seurat (Wikipedia)

Portfolio allocation is art, and in some ways, I am defending this high-dimensional pointillism — my experiments seem to provide an upper bound on the loss from this chunky pseudo-diagonalization, a little like when you stand back from a Seurat. When you view the output of an HRP optimization you might also be reasonably pleased — especially if your benchmark is an overfitted optimization where five assets get all the weight, say.

Don’t worry you can still have pretty plots and diverse allocation if you use Schur Complementary Portfolios instead of Hierarchical Risk Parity. Picture by Munich Re.

Yet one might reasonably ask, given my framing, whether this particular choice of restricted diagonalization is well-motivated. I don’t have an answer but I can show you something that gives you a bit more artistic license.

Conditional or Unconditional Covariance?

There is something else to notice that is closer to my theme and easier to couch in economic terms.

Notice that in HRP we carry to the next step in the recursive allocation the original covariance sub-matrices A and D. We use — just to emphasize this fact that is right under our noses — covariance numbers that are exactly the restrictions of the global covariance matrix (after permutation, of course). Does that accord with your financial intuition? Are you absolutely sure?

Let me prod a little. Suppose you were told in advance of the returns of half of the stocks in your portfolio. Suppose you cannot change the allocation to assets whose returns are now known, but you are as yet free to modify the sub-allocation within the assets whose outcomes haven’t been revealed.

Presumably, this foreknowledge would change your analysis. In particular, you might be tempted to calculate the conditional covariance of the stocks whose relative allocation you are yet to decide on (like so). Well, it is an accident of nature (or at least its gaussian approximation) that the conditional covariance is known in advance. Where does that leave you?

A New Class of Top-Down Portfolio Techniques

That last contemplation on conditional covariance could motivate a new style of top-down allocation, could it not?

I mean it can motivate you, the reader because it is too late for me — I only noticed it afterward. My path was different and had more to do with general mathematical discomfort and a penchant for browsing linear algebra obscurities late at night (not that obscure actually).

It is my view that there is a lack of clear motivation for the top-down allocation division step (a general comment, not just for HRP)— though I do think HRP is a clever, important contribution regardless.

I wish to break the mold established by Lopez de Prado and other top-down schemes by revisiting the tacit part of the recursive allocation and changing the covariance matrices that are passed to the next step — and I will try to persuade you of the inherent logic of doing so.

Consider this more general kind of divide-and-conquer approach:

Hierarchical allocation when matrices A and D are augmented

where the original A and D sub-covariance matrices are swapped out for modifications to the same A’, A’’, D’, and D’’ that I will come to. The idea is that we can sneak some information from B inside these, with a view to improving the final result.

Moreover, I’m going to show that this can move us “in the direction of optimization” while still staying in the top-down world. I use scare quotes because this is quite different to moving in the portfolio space, say with a convex combination of HRP and Markowitz — something that might not always improve on HRP if the latter is overfitted.

We are moving in method space instead, staying within the top-down class. And that way we stand a chance of surviving in very high dimensions — just as the original HRP approach does. You might suppose that you have 5000 assets and only sixty months of historical data to estimate a covariance matrix, for instance.

So, continuing to defer motivation I posit a choice for A” :

where the fraction indicates pointwise division. The numerator gives rise to my choice of the method name in the title of this post, and we recognize it as a Schur complement.

I remind the reader that Schur complements arise when we condition a multivariate normal distribution on partial evidence and also in the block-wise inversion of matrices. Scanning halfway down the Wikipedia page for invertible matrices, you will find this unattributed equation:

That identity shall guide us but lest this gets too abstract too quickly, let me give an example.

An Example of “Fixing” a Top-Down Allocation

You’ll notice that the strategy I’m adopting is a consideration of the minimum variance portfolio and the awkwardness of its reconstruction using top-down allocation. That’s not really so different to the motivation for other top-down schemes (note the role of inverse variance) and it does not require one to be wedded to a belief that minimum variance portfolios are best.

With that out of the way, let’s take

which is as trivial as you can get. No seriation helps here and in any top-down bisection we have an unpleasant split:

If you follow the recipe, then, unfortunately (exercise for the reader) we end up with a portfolio w that is not symmetric.

However, I’ve hinted at how to “fix” this. Leaning on that matrix inversion identity we have

where I have denoted Schur complements such as:

Schur complement inverse example

Continuing the algebra we are able to break the symmetry yet recombine it all back into a symmetric portfolio successfully, viz:

Now again, I’m not saying minimum variance is best. I’m saying that it plays the role of a suggestion — just as it does in HRP. You see above that the minimum variance portfolio has a real “split” representation, not a forced imagined one as it does in top-down allocations. Don’t see it? Collapse:

This, I hope, starts to motivate a choice of A’s via b’s that is not just the obvious A’=A’’=A and D’=D’’=D. Still, I need to do a little more work to interpret the right-hand side in a manner that suggests a canonical top-down allocation scheme that you are used to. We need to “read” the right-hand side financially.

A Financial Interpretation of Linear Equations

To that end, a tiny aside. Suppose we define for any covariance matrix Q the weights

This, by the way, finds interpretation as a minimum variance portfolio but not the one we first considered. It solves

and you will note that the constraint uses b, not a vector of ones. With respect to Q, this portfolio has some variance nu(Q) and after computing that and rearranging we notice:

Financial interpretation of the solution of a linear system of equations

In words: the solution of Qx=b can be interpreted as the minimum variance portfolio, normalized by portfolio variance.

How to “Read” Schur Hierarchical Schemes

I now propose a heuristic for the design of top-down portfolio allocation methodologies. The idea is to use the matrix identity given and also the financial interpretation of linear systems of equations. Let me illustrate, and in doing so arrive at a suggestion for matrix A’, A’’, D’, and D’’ (the A’s and D’s are symmetrical btw). Stare at that right-hand side:

Minimum variance portfolio in a segregated format

Now swap in the financial interpretation. This yields a minimum variance portfolio with different constraints, and that might not always be the most suggestive — it depends on your portfolio libraries and their APIs, to some extent.

We can go a little further and make a change of coordinate so it looks like we are allocating in such a way that weights sum to unity. I claim this leads to:

where this time w() is a portfolio with weights summing to unity (that makes it easy to modify in your mind’s eye for something else). There are your A’, A’’, D’, and D’’ too, as promised. All you have to do is follow the usual seriate-split-allocate-repeat cycle, but make a modification to the arguments you pass in.

There’s a new operator modifying the covariance A’ that is used to compute the portfolio variance here. My notation is:

An operator interpreted as “pointwise multiplication in the precision domain”

Maybe a reader can suggest a better or existing name for “multiplication in the precision domain”. I’ll spare you the commutative diagram but this arises because we need to transform from one minimum variance portfolio to another — by change of variable. The relevant observation is the following:

That is why we arrive at

for the recursive step and thereby, a top-down scheme. It might be represented schematically:

Schematic for covariance-augmented top-down allocation. Some information from B carries down.

A Partial Continuum

This is where you say “Schurly, you’re joking?”. Because if we recover the solution to the optimization then we must also recover its weaknesses, right?

Yes, absolutely, but we never make it there usually. As a practical matter, the augmented matrices can provide issues for your w(A’’) or w(D’’). The details will rather depend on the assumptions made by software packages or the explicit checks they perform, such as positive definiteness.

Also, we don’t want all the Schur contributions anyway so I slip parameters into the b’s and the Schur complements to finesse this issue, and provide a partial continuum. The idea is to allow you to decide how much information from B to carry through. So the actual scheme is more like the following:

Schur Portfolio construction differs from other top-down allocation schemes such as Hierarchical Risk Parity (HRP) because matrices A and D are modified in the recursive step.

Here

and

to give us some flexibility to travel “in the direction of optimization”, as I put it before. The rightmost column of the table below indicates this traversal, subject to the caveat about not getting there.

Extremal points on a continuum of portfolio allocation methods spanning both hierarchical (recursive) and optimization philosophies. Corresponding parameters for Schur complementary portfolios are shown that can, modulo technicalities, replicate the output. This assumes seriation has been used to reorder assets, an important ingredient in hierarchical risk parity.

As a finer point, what I generally like to do is use the suitability of the augmented matrices as a back-off mechanism determining a maximal choice of gamma, lambda, or both (I usually equate them). So in most of my code, the interpretation of these parameters is a fraction of a maximally allowed “Schurness”.

Does It Work?

I schurlishly suggest my method is more beautiful than HRP and therefore is almost certain to work better. To be serious I really would prefer that the reader decides for themselves. You can play with other ideas and parameterizations as well.

It does appear to me that ironing out some of the mild mathematical inelegance in HRP has more than aesthetic motivation — at least if you are an institutional money manager or someone investing at scale.

But it will depend on your choices of leaf portfolio and the metric nu(), including whatever shrinkage is employed. In my own experiments, I focused on the use of a defensive but “weak” style of shrinkage. There, Schur augmentation in place of the default is probably good for about 5–20 basis points.

It is a small lunch, but probably pretty much a free one, compared to other approaches that try to combat the curse of high dimensions with much more radical departures from HRP (such as direct global optimization). Here I have tried not to throw the HRP baby out with the bathwater.

Three different examples of relative portfolio variance as the gamma parameter (x-axis) is varied from gamma=0 (traditional top-down allocation) to gamma=1 (using information from the Schur complement to the extent that A’ remains positive definite). Numbers are normalized by the portfolio variance obtained using a hierarchical scheme where A’=A’’=A. In this example p=500 assets are used with a true covariance matrix assumed to be the empirical covariance of a=50 samples from a symmetric model with constant off-diagonal correlation rho=0.35. Then o=60 observations are used to estimate the covariance matrix used for top-down allocation. The approximate economic benefit is on the order of 9bps of additional return per year.

In passing, I trust the previous discussion also makes it clear that my method is also ad-hoc — sounds better to say “in the tinkering machine-learning tradition”. I do not fully understand it by any means, and it might also be construed as under-motivated — the same criticism of HRP that led me to take a closer look in the first place.

Another Example

Hierarchical Risk Parity is not the only ad-hoc allocation method avoiding matrix inversion. Other types of grouping from the bottom up are used too, so let’s look at how Schur can also “fix” or improve another example. Consider:

where

and we don’t actually care about the normalizing coefficient (same as before). The minimum variance portfolio is

I choose this example because “indexing” might fail. By that, I refer to the idea of treating the first two assets as one asset and the second two as another — then using the inter-group covariance matrix to inter-allocate. This kind of “indexing” might be regarded as a crude multi-scale approximation. This is an old topic in applied mathematics. (I wrote a note here on “homogenization”, for those who are interested, to illustrate that more careful multi-scale methods can yield insight).

As an aside, in this particular asset allocation example, you might be better served by David Disatnik’s paper where it is argued that placing restrictions on the covariance matrix (that ensures long-only portfolios) can provide enough stricture to beat the 1/n benchmark.

A prespecified block-covariance structure provided by David Disatnik. In contrast, top-down allocation using Schur complements starts with any covariance matrix — but the connection to block results is interesting.

Not everyone wants to work with covariance restrictions, however, especially as the dimension increases. I want to show you that the Schur approach and the motivating inversion identity provide a different way to be “inspired by minimum variance” but at the same time remain leery of the pitfalls of direct optimization.

To work we go again:

Thus:

and:

as before, and

for the other Schur complement. So

and now finally we can start to see why the Schur top-down method might alleviate the awkwardness of “indexing” allocation, because:

That’s a split representation of the global portfolio and gives me a chance to explain a slightly different perspective. If we can find invertible matrices

and

(you might throw in a shrinkage parameter as I did earlier) this allows us to collapse the right-hand side.

Now that we are practiced in the art, we recognize this as a top-down allocation scheme with many obvious generalizations. You can swap in and out different portfolio methods, as before. By the way, we can write down those postulated matrices in this example. They are

and

Nuts and Bolts and Python

Have fun with this. If you would like to try this out without modification it is pretty trivial. In fact, you can even use compute Schur-complementary portfolios with one line of code. I recently added some documentation, although there isn’t much to it:

Note also that the precise package also includes covariance forecasting techniques that are absolutely certain to put you in the top decile of the M6 forecasting contest (based on a naive frequentist analysis). You can mix and match.

Since the change I propose is the modification of the A and D matrices, we might even be able to lean on other friends in the open-source portfolio community to provide other implementations of Schur portfolios— if my package is not your style, that is.

References and Further Reading

Journal submission rules prevent me from posting a paper, but you can contact me should you be interested.

As far as related reading goes, here are some papers very close to the topic — some directly suggest modifications to HRP, such as choice of seriation or shrinkage or nu or w, and thus that can also be interpreted as suggestions on how to reduce the general idea I have presented to practice. As noted Lopez de Prado’s paper is on SSRN.

On my to-add list is Adaptive Seriational Risk Parity and Other Extensions for Heuristic Portfolio Construction Using Machine Learning and Graph Theory by Peter Schwendner, Jochen Papenbrock, Markus Jaeger, and Stephan Krügel (link) which can also be used with Schur of course. There’s more in the blogosphere, such as Rafael Nicholas Fermin Cota’s note on the shortfalls of HRP.

There is a wider literature of course. There are many, many more papers dealing with the problem that HRP sets out to solve by different means. I would appreciate pull requests to my little cache of interesting papers on robust portfolio construction — which is far from comprehensive.

About Me

This work is supported by Intech Investments and I’ve benefited from conversations with Adrian Banner and Jose Marques in particular. This post is a bit of an exception as I write mostly about the non-financial aspects of my work (the microprediction project). As of recently, I’m also the author of a book.