Return Forecasting with Moments, Part 2: Term Elimination and Zero Tolerancing

20 min readDec 4, 2023

Photo by Carlos Irineu da Costa on Unsplash This Nixie tube display suggests more significant figures than our model will be able to predict to.

Once the Excel enthusiast in you starts to understand our prior article on estimating future returns via prior moments, you can move on to refinements of the model, some of which we describe in this article. For now, we stick to linear models via Excel LINEST, and simplify them to make them fit even less better… with an eye towards making them predict better.

This article is part 2 of our linear exploration of return moments in Excel. Part 1 is here:

https://medium.com/@nttp/n-day-ahead-stock-return-forecasting-with-moment-like-predictors-5d73793c4f15

An updated Excel file for this part 2 can be found here, if you would like to follow along in the sheet:

https://github.com/diffent/excel/blob/main/forecasttestRoll9TSLAztolB.xlsm

While software plug-ins exist for linear model refinement in Excel — such as for forward and backward stepwise regression, and perhaps even exhaustive best-subsets regression — we can start with a fairly simple backwards stepwise macro for our LINEST refinement. By a stroke of luck (rare enough if you are an overwhelmed software developer), the LINEST function seems to continue to work just fine if you give it one or more columns of zeros in its X range input. This makes building a backwards stepwise macro fairly easy, without any need to move columns of data around and remove constant columns. Some regression codes are touchy and “puke” (to use a technical term) if you give the same column of X data more than once. But at least for multiple zero-filled columns, LINEST seems to work okay.

The reader can examine our backwards stepwise regression macro code, which is as follows:

Sub backwardsStepwise()
   ' simple backward stepwise regression
   ' make sure all variables are on to start (all masks set to 1)
   Call restoreAllVariables
   Application.Calculate
   Let tol = 0.001
   Let minAllowedT = Range("J18").Value
   ' cycle over min abs(T) range across columns to the right
   Let nmasked = 0
   Let themin = 0
   For j = 0 To 100 ' ludicrously large
   ' crude way to find the min T score variable
   For i = 0 To 5
     themin = Range("Q21").Offset(0, i).Value
     Let testvalue = Range("Q15").Offset(0, i).Value
     Let priorMask = Range("Q25").Offset(0, i).Value
     If ((themin < minAllowedT) And (Abs(themin - testvalue) < tol)) Then
        Range("Q25").Offset(0, i).Value = 0
        Application.Calculate
        nmasked = nmasked + 1    
     End If    
   Next
   themin = Range("Q21").Offset(0, i).Value
   ' no more
   If (themin >= minAllowedT) Then
      Exit For ' exit the ludicrously large loop
   End If
   Next
End Sub

We set up some formulas in the sheet to make the macro simpler for this particular example, but it may be an interesting exercise sometime to write a “drop in replacement” for LINEST that does backwards stepwise term removal. Just look up “stepwise regression Excel,” on Google — or, dare I say: Bing? [Note 4] — and you will find a bunch of cool stuff related to this. However, some of those (commercial) packages are user interface driven regression and not drop-in formulas that are macro drive-able (which we need for automated backtesting).

Okay, enough background: Let’s proceed to our formulation of backwards term elimination.

T score review

We set up the T scores in the sheet for each coefficient (discussed in the prior article; these indicate, formally, the “statistical significance” [Note 1] of each variable in the model; the farther that each T is from zero, the better… that is, if we intend to keep that variable in the model).

Variable mask flags

Then in Q25:V25, we set up an X variable mask. These mask cells will be allowed to take on values 0 or 1 by convention, but will not be limited to these two values in any way. So be careful not to put a 2 or something wacky like that in one of those cells, or you will get confused later. Note that we ignore the constant at the right end (column W) for this term masking procedure. We want to always keep the constant in the model for now (see our prior article for some hints on why we do this; short answer: It is due to us using 1-centered returns in this model).

Manual variable masking allowed

During the backwards stepwise method, we always re-set these masks to 1.0 in the term removal macro before weeding out variables. But if you decide to skip the backwards stepwise approach (and how to do this is discussed later in this article), the sheet will use the masks as-is, so you can mask out variables manually if you want, then run a backtest.

Hence, take care not to accidentally leave any of these masks set to 0 if you want to clear the T tolerance back to zero, to keep all terms in the model.

Variable mask mods to formulas

Next, we apply the masks to our X variables in columns D to I.

The X variable mask setup for column D follows.

Recall the reversed output situation with LINEST:

https://www.excelforum.com/excel-general/353071-linest-why-did-they-do-that.html

The left-most variable mean (in column D) uses the mask ($V$25) from the right-most variable in the LINEST output (aside from the model constant, which we ignore for now).

=AVERAGE(OFFSET(C2,0,0,$J$3,1))*$V$25

The mask flags are applied to columns E thru I similarly.

Now if we manually clear (set to zero) a mask flag in Q25:V25, we can see that (a) the associated X column gets set to zero. And (b) LINEST still solves.

Figure 1: Example of masking out variables tail9505 and tail7525 manually. Coefficients from LINEST then show as 0, and T scores are encoded to 999 in our formulas so we know to ignore these variables in the backwards stepwise algorithm.

When it does solve, LINEST sets both the coefficient and standard error associated with the zero-masked variable to zero in its outputs. Since the stderr gets set to 0 and our T score is coeff/stderr, we adjust the T score formulas in Q14:W14 to report out 999 instead of the Excel #DIV/0! error when stderr = 0. Why set T to 999 instead of zero in this case? This plays into our term selection methodology for our stepwise regression macro (particularly, the way we choose the next variable to eliminate), which we will discuss shortly.

Absolute T scores

Next, we compute abs(T) in Q15:W15 and then compute the min of this range in every cell of Q21:W21 (the same formula in every cell of Q21:W21).

Now we have the pre-computed results that we need for our backwards stepwise macro.

Backwards stepwise regression macro

All this macro does is:

A) cycles through the X variables,

B) finds the variable with the lowest abs(T) score (this will be the least statistically significant variable at that step in the procedure, the variable whose coefficient has the most uncertainty or “slop” associated with it) [Note 2],

C) sets the associated mask flag to zero (to get rid of that variable from the regression),

D) recomputes the sheet (and hence, LINEST),

then repeats until all T scores are above a specific tolerance that the user sets up in J18. If all T scores are above this tolerance, all variables are kept. If this T tolerance is set to zero, the backtest macro skips the backwards stepwise procedure entirely and leaves the results at full count X variable regression (or at our pre-masked values as noted earlier). Setting T cutoff to zero allows the overall backtest to run faster, since backwards stepwise regression is a bit time consuming in this non-optimized Excel sheet and macro.

Step B (above) is where we need to have the previously eliminated variables set to a high T (999 in our example) so that we don’t find the same variable again the next pass around. Otherwise, we might (will?) just keep finding the same variable to eliminate over and over again if we had instead set T to 0 instead of 999. Hence, we set this to 999 to “get it out of the way” of our macro logic. Since time immemorial (say, 1957), programmers have used 99, 999, 9999, et cetera as “magic number” codes. Such magic numbers are contraindicated in modern software, but this is just an Excel macro, and we thought that we would throw it in there for old times’ sake.

The last bit (whether to do backwards stepwise or not) is set up in the backtesting macro. At each step of the backtest, the full regression is computed, then the backwards stepwise term elimination is performed, optionally, depending on minAllowedT in J18. Remember that this minAllowedT is referring to abs(T), the distance of T from zero. Sometimes we run fast-and-loose with verbiage [Note 3] in the code.

      ' this is a sub portion of the backtest macro

      ' recompute regression and whole sheet
      Application.Calculate
    
      Let minAllowedT = Range("J18").Value
      
      If (minAllowedT > 0) Then
        Call backwardsStepwise
      End If

This backwards stepwise term elimination is done separately for every backtest step, because different X variables (our moments and tail ratios as described in part one of this paper) may be important at different times in the past.

Finally, during the backtest macro stepping, we also record the T scores at each step of the backtest in columns AB to AH for later review.

Figure 2: T scores output during an example backtest (columns AB and “to the right” of it). 999 indicates that the associated variable has been removed from the model at that step. Each row corresponds to one step of the backtest.

At a glance, we can see that a T score of 999 indicates that the associated variable is not used at that step (was eliminated). These T score columns during the backtest were present in the older Excel file associated with our prior related article, but we did not use them in the original study — except to indicate by inspection that a stepwise regressor is probably needed for this model. In the above screenshot after running our backwards stepwise macro, we now see that we have more solid absolute T scores (most > 2, some in between 1 and 2; we must have set a T cutoff of 1 for this run in Figure 2). Remember, all |T| > 2 means (to pick an easy example) is that a standard error of a coefficient is less than half of the coefficient value itself. The smaller the stderr, the larger the absolute T score. There is no magic involved. Well… maybe internally in LINEST there is some matrix magic involved to compute those standard errors. But that is another story that you do not need to worry about now, since some Redmondites of old have solved it for you. Maybe they even use LAPACK!

https://www.netlib.org/lapack/lug/node27.html

Okay, so now we can backtest our model with automatic variable elimination so that we can reduce overfitting. This technique is identical in aim as “feature selection” is in ML models.

The following GIF shows X columns being zeroed out in turn during backtesting with backwards stepwise regression enabled (with T cutoff = 1; a fairly weak cutoff. Normally, T scores between 1 and 2 are also considered to be fairly weak; but we start slowly with term elimination, to see what happens).

Figure 3: Variable elimination during backtest example. Variables in columns E to I getting zeroed out automatically by the backwards stepwise algorithm. Note the colored cells moving back in time during the backtest to indicate the stepping.

First Trials Of Term Elimination

Let’s recall our prior results for TSLA 5 day ahead forecast: 252 day backtest (1 year), 100 points in regression:

T cutoff = 0
R2 of backtest 0.256
169/252 = 0.670 directionally correct during backtest

T cutoff = 1
R2 of backtest 0.253
159/252 = 0.670

T cutoff = 2
R2 of backtest 0.254
167/252 = 0.663

T cutoff = 3
R2 of backtest 0.217
167/252 = 0.663

Changing the T cutoff to various commonly used levels (1, 2, 3) did not improve this backtest result appreciably (or at all) for TSLA, for a 1 year backtest with the model configuration as-is: number of points in regression = 100, number of points to compute moments = 200, 5 days forward, and so on. In some cases, the backtest R2 was even reduced a bit (the wrong direction… we would prefer a larger R2 in a backtest), but the directional percent correct stayed about the same. Such is the nature of these “looking for a needle in a haystack” models. It could be the case that we merely got lucky forecasting during those few extra correct days during the baseline “no term removal” trial, thus boosting our R2 a little for that baseline case. It’s not really good to boost your R2 by luck, though, because it can make you think that your model is better than it is, and possibly by extension that you are smarter than you actually are.

Similar behavior was observed for a two year backtest (504 trading days) of this same model: Slight reduction in backtest R2 as we increased the T cutoff, but percent correct direction counts remaining about the same.

T = 0
R2 of backtest 0.119
312/504 = 0.619

T = 1
R2 of backtest = 0.119
314/504 = 0.623

T = 2
R2 of backtest 0.110
312/504 = 0.619

T = 3
R2 of backtest 0.101
306/504 = 0.607

Since statisticians prefer simpler models in general if they can maintain similar forecast quality as more complicated models (the concept of parsimony), it seems that one might want to set the abs(T) cutoff to 2 or 3 for this model, as long as OOS quality does not degrade appreciably. Maybe a T cutoff of 1 is not unrealistic since that yielded slightly better results for the 2 year backtest. But again, those small percent improvements could be just due to luck, so… caveat analysta.

There is also some support for setting T = 1 (almost no degradation of backtest results in the 2 year case).

Since the backtest R2 is an aggregate result, there may be a way that we can compute an Adjusted backtest R2…

(or related metrics: BIC, AIC, and so on)

Akaike information criterion - Wikipedia

The Akaike information criterion ( AIC) is an estimator of prediction error and thereby relative quality of statistical…

en.wikipedia.org

https://en.wikipedia.org/wiki/Bayesian_information_criterion

… by some sort of average or range of the R2 of fit values that were computed during the backtest. This might give a better model selection criterion (what T cutoff to use, what other model parameter values to use) than our simple backtest R2 and direction-correct counting metrics. Much like how real estate professionals use the phrase “location, location, location” to indicate what is important in real estate, in modeling, “refinement, refinement, refinement” is a key phrase to remember. And by refinement, we do not mean merely fitting the data better. With today’s technology, we can fit all data exactly, with any number of variables. However, this does not indicate a good model, and it is why we start with linear models instead of fully flexible TensorFlow types of ML models.

Figure 4: R2 of fits during backtest. One row per backtest step. Note that these are suspiciously higher than the out of sample backtest R2… Another thing that makes you go: “hmm…”

Another hint that we should push toward simpler models comes from looking at in-sample R2 values from all of the regressions performed during the backtest. These are significantly higher than the backtest R2 values (sometimes 2x or more). For example, our recent two year backtest with T variable weed-out cutoff of 3 showed a mean R2 of fit of 0.40 (over 504 days or 2 years worth of regressions, one R2 per backtest day). This is much greater than the related OOS backtest R2 of 0.101 (with associated percent directionally correct 0.607 or 60.1%), suggesting over-fitting. Hence, other ways to reduce overfit might be worth looking into. As this is merely a simple linear model, we have limited yet intriguing options to adjust the model, such as increasing the number of points per regression (originally 100) and/or tuning the number of points used to compute the moments (now 200). Maybe we even want to set the T cutoff even higher than 3 to only leave “super solid” (technical term?) remaining variables in the model.

And the variables we feed into this model are not the end-all of variables. For example, and as we noted in our prior article, we could compute similar moments for a related (technology) ETF (exchange traded fund) and put those into the model to see if we get better or worse forecasts. Or maybe we look into the moments and tail ratios of SPY, an S&P500 ETF; perhaps we should add these to the model. That is, maybe we can boost an individual asset model’s quality by including more stable sector-wide or market-wide derived data as candidate predictors. Our new stepwise term elimination will tell us if this was a good idea or not. As we noted in our prior article, we could even try to add lagged variables (auto regressive or AR terms), and, going beyond ordinary LINEST models, add MA terms which feed back model errors into the next forecasts.

“The MA part of ARIMA indicates that the forecast error is a linear combination of past respective errors.”
(from the above MA terms link)

Stepwise regression summary

We noted a slight decrease in backtest R2 when using our backwards stepwise term eliminator macro, with directionally correct results being about the same as achieved with full term models, for 1 and 2 year backtests. While the hoped-for result of increased backtest R2 did not occur, we have not done any model tuning in this term elimination context, and term elimination (model simplification) via tossing out uncertain coefficients (those coefficients with large standard errors) is, in general, a good idea; so we have the stepwise method at-the-ready when needed.

Bullish/bearish zero tolerance

Next, we introduce a zero tolerance for our directionally correct forecasts so that we can discard “close to the edge” indeterminate days of forecasts that are on the border between bullish and bearish. For those indeterminate days, if for example we used this type of model to compute trading signals, we would report out “no signal today, indeterminate.” We had noted in prior studies that using a zero tolerance (which we sometimes referred to as a “dead zone” of the model) seems to boost the prediction quality (percent correct) for the points that are outside this indeterminate range. This zero tolerance is set up (by the user) in cell J21. Ordinarily, we would put the computations for these next to the directionally correct computed results in column O. However, to avoid disrupting the location of our LINEST and related formula ranges (since our prior article refers to these specific cells for discussion), we put these indeterminate calculations out to the right on the sheet at columns AL and AM.

Figure 5: Zero tolerance formulas. Since our returns our 1-centered in this sheet (1 = neither bullish nor bearish), we subtract 1 from the estimated return in the column AL formulas.

The formulas are simple: If a forecast is outside the zero tolerance (measured from 1, the neutral in our 1-centered returns definition), we mark it as TRUE (forecast outside of tol) in column AL. Then in column AM, we logically compute if the forecast is both correct and outside the tolerance. Columns AL and AM are summed up in P6 and P8 (note that “correct and outside tolerance” will always be <= “outside of tolerance”). Then the ratio of these is computed in P10. Hence, we can easily compare the full percent correct ratio in P4, with the new “zero toleranced” percent correct in P10. If we set the zero tolerance to 0, cell P10 should report the same number as P4.

We see that if we adjust the zero tolerance in J21 manually, we can observe how the percent correct changes in P10.

Figure 6: Zero tolerance set to 0 (baseline case). Hence % direction correct for the full data set is the same as the % direction correct for the zero toleranced data set (both 0.607 ~= 61%).

Figure 7: Setting zero tolerance to 0.05 boosts the percent correct for forecasts outside of +- 0.05 … to 0.713 (~= 71%)

Since this tolerance is applied to backtest results, we first need to run a backtest, then tune this parameter. de Prado’s advice:

Backtesting

Empirical Finance is in crisis: Our most important "discovery" tool is historical simulation, and yet, most backtests…

papers.ssrn.com

(roughly stated as “don’t snoop too much” on your data or you will be sorry) applies to tuning this parameter also. Models today are so flexible that one can overfit to supposedly out-of-sample backtests just as well as to in-sample data; yet still, such a model can fail in true out-of-sample future application. Luckily, we are using a linear model, and it can only bend so much. In fact, it cannot bend at all, as it is a linear model! Pure hyperplane flatness, as far as the mind’s eye can see…

The alert reader may be able to visualize that if we reduce the number of points sent into the regression [cell J12], the flat model fits a smaller time window of data and in a sense “tilts” (in multiple dimensions) rather than “bends” closer to the true tangent of the theoretical true surface as defined by the data. See our picture of tangent versus secant in our earlier paper, or see it directly on: https://en.wikipedia.org/wiki/Tangent

And yet: Tuning this indeterminate zero tolerance parameter on a backtest is — like all other tuning based upon backtest results — treating formerly out-of-sample points (1 day forward forecasts) as in-sample with respect to this new tuning procedure. Hence, it may be more robust to analyze historical returns and compute an indeterminate range zero tolerance by some other means than from backtest results.

Zero tolerance test

Nonetheless, let’s try setting J21 to 0.1 for the 1 year / 252 day backtest case. And for a baseline if you’re following along in Excel, let’s keep min T (J18) to 0 (all variables in the model). Remember that if we change the min T (J18), we should re-run the backtest. Setting J21 to 0.1 means that we treat all forecasts in the range [-10%, 10%] as indeterminate, disregarding endpoint considerations. Also meaning: We only want to keep the really strong forecasts, those greater than 10%… a strong week averaging 2% per day return or more. Recall that our model is currently set up for 5 day returns forecasting, not 1 day returns; a 10% weekly return from TSLA is not that unusual. Neither is a 10% loss the next week.

In fact, if we do a little calculation on the side for column L (the 5 day rolling actual return of TSLA), we can compute that its standard deviation is 0.096 (close to 0.1). Using the thumb rule that 68% of values fall within 1 standard deviation (approximating with a normal distribution for now), 32% of values fall outside this range, so we would see about 32% of the 5 day returns being greater than 10% or less than 10% in the historical data. Hence, 10% for a zero tolerance is not unreasonable for this case.

Our counts show 35 out of 252 backtest days had five day returns forecasted to be outside this 10% range, and of those, 30 were in the correct direction, or 85%.

Figure 8: Setting zero tolerance to 10% for a 1 year backtest, all variables in the model (T tolerance = 0). 85% correct for forecasts outside the zero tolerance range of +- 10% forecast.

Not bad for a weakly tuned linear model trying to forecast 5 days ahead. Does your LSTM GPT transformer [insert more tech buzz words] GPU powered deep learning machine learning model get 85% correct out of sample? Ha! We joke! We kid! Certainly there is a time and place for those models. And again, any hubris we might have related to this high percentage needs to be tempered by asking the question regarding how much data snooping we are doing (overfitting to backtests) by tuning with this parameter while looking at backtests. How much data snooping are we doing? Well, we are tuning a parameter by looking at what we originally classified as out-of-sample data. So: a lot. We are doing a lot of data snooping.

On the contrary, by setting a zero tolerance > 0, we are merely throwing away (what we call) weaker predictive signals, which seems to be a reasonable thing to do when forecasting. The snooping part is tuning the model to a particular indeterminate cutoff by feeding back information from the backtest to our cutoff adjustment.

Another tolerancing hazard

When increasing this zero tolerance, it is possible that the denominator of the percent correct ratio can get small (e.g. you can end up discarding most of the forecasts by classifying them as indeterminate). In this case, we might see percent correct values approaching 100%. But for these “low N” cases, we can start running into issues along the lines of: Well, there is now a non-zero probability that we could have achieved these good results randomly just by flipping a coin. We go into this in some detail in our more formal white paper on a related modeler system [https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4137779] [see Note 2.2 in that paper], but just to seed the idea from your own experience: It is fairly easy to get 3 out of 4 coin flips to show up how you want by random chance (75% correct). Go ahead and try it! But it is difficult (close to impossible) to get 75 out of 100 coin flips to show up how you want by random chance. It’s not just the percentage correct that you need to look at when analyzing these types of directional forecasts; you also need to look at the number of trials (the denominator in our percent correct case). Experienced statisticians may start to think of binomial trial theory and the like. We do a bit of this in the above linked white paper [Note 2.2 in it], so check it out if you want more background. In this particular study, our denominators are still pretty large even with setting a high-ish zero tolerance, so we may not need to be too worried yet about this issue. But it may come up eventually.

Summary

In this next step of our moment-based model development, we introduced backward stepwise term elimination in an Excel VB macro to work along with the LINEST function. We found that term elimination actually reduced OOS forecast quality a bit for this particular model set up. Since eliminating weak terms from a model — terms with larger standard errors in our case — is both recommended in traditional statistics and also in machine learning, the reduction in apparent OOS quality may have been merely due eliminating a few seemingly correct forecasts that happened by accident. A “thinning of the herd” of weaker forecasts, so to speak.

We also applied a zero tolerance to our forecasts to see if we might get improved directional results in the backtest. Of these two improvements — for this particular model — the zero tolerance seemed to give more “percent correct” improvements in the backtest than the term elimination. However, the analyst must be mindful of over-fitting to backtests when tuning this zero tolerance.

Further reading

To see a nice compact experiment showing the importance of looking at T scores when doing regression (in the context of this study), check out Part 3 of this series, where we replace a candidate predictor X column with random values to see what happens (“Shake it up and watch it fizz, boy!” in the vocabulary of the “Foghorn Leghorn” rooster cartoon):

https://medium.com/@nttp/return-forecasting-with-moments-part-3-the-importance-of-t-scores-6a766ee4cffd

Updates

Dec 6 2023: Add Further Reading section.

Notes

[Note 1] The reader should not get hung up on the term “statistical significance.” T score is merely the coefficient value divided by its error as estimated by the LINEST function, and error is measured in a “standard deviation” type of range, so that +-2*stderr covers about 95% of the estimated range of the coefficient.

[Note 2] Many statistics packages refer to the coefficient’s p-value instead of the T score when determining which variables to discard from a model. However, p-value is merely derived from T score, and this extra level of look-up or indirection can add to user confusion. p-values have their use, but they can also be easily mis-interpreted… so… we don’t bother with them here. There is probably too much lore to overcome with p-values in a short article like this. But T scores… we like those. They’re all right. Also note Cochran’s reminder that a variable may be statistically insignificant, yet may still be economically significant. E.g. if a variable has a large effect on the result, yet is not considered “statistically significant” when looking at T score, we might want to sharpen our pencils a bit and investigate that variable more.

[Note 3] We prefer verbage instead of verbiage, since the “i” seems superfluious, but the dictionary says “verbiage.” So.

[Note 4] We are working on an American college football style “fight song” for Bing, the perpetual underdog, something along the lines of: “And when you want to Google it / why don’t you Google it / on Bing!” For some reason, Capitol Records hasn’t come calling yet.

Return Forecasting with Moments, Part 2: Term Elimination and Zero Tolerancing

Akaike information criterion - Wikipedia

The Akaike information criterion ( AIC) is an estimator of prediction error and thereby relative quality of statistical…

Backtesting

Empirical Finance is in crisis: Our most important "discovery" tool is historical simulation, and yet, most backtests…

Written by NTTP