A Backtesting Framework

In the May edition of “Technically Speaking” released by the Market Technicians association there was an interesting reprint of a blog post written by Tucker Balch entitled “9 Mistakes Quants Make that Cause Backtests to Lie”. The post is clear and concise and provides an excellent roadmap that aspiring quants can use to generate reliable test results. However, the post fails to mention what I believe to be two extremely important requirements for generating test results that have the highest probability of being realised in a live environment, namely logic verification and sample size. In today’s post I’m going to discuss these two additional points, but before we do let’s recap Balch’s list:

1) In-sample backtesting
2) Using survivorship-biased data
3) Lookahead bias
4) Ignoring market impact
5) Ignoring liquidity
6) Over-optimization
7) Complex models
8) Stateful strategy luck
9) Data mining fallacy
10) Strategy logic verification
11) Sample size

You can read Tucker’s full post here.

Strategy Logic Verification

There should be a credible reason as to why a strategy works. Either it is related to a human behavioural bias that is not likely to change, or it is related to an institutional feature that cannot be easily changed. Without an intuitive and sound explanation for the perceived market anomaly, the robustness of a strategy is greatly diminished. For example, if one were to run a computer overnight seeking strong correlations between stocks for a pairs trading strategy, it’s highly likely that multiple correlations will be uncovered in non-related stocks due to pure chance alone. If one were to then build and test a pair’s strategy that exploits the strong correlation of any two non-related stocks, one would likely uncover good test results. However, there’s no reason that the strategy will continue to work because the logic is flawed: in this scenario correlation does not mean causation so there’s no basis for the correlation to persist through time. If we instead tested a pair’s strategy that exploited two related stocks, for instance two gold stocks, we’d be more convinced of the test results.

Verifying strategy logic by understanding why the strategy works is a crucial ingredient to building a robust regime that will stand the test of time. Here are a couple reasons why some well-known market inefficiencies work and are likely to persist:

Valuation
Definition: low valuation outperforms high valuation
Why it works: long run mean reversion related to investor herding

Low volatility
Definition: low volatility assets outperform high volatility assets
Why it works: Investor gambling preference and institutional intolerance of tracking error

Momentum
Definition: past returns predict future returns
Why it works: investor herding behaviour, multitude of emotional biases causing initial under-reaction and delayed overreaction

Mean reversion
Definition: short run return predicts opposite future return
Why it works: availability bias, aversion to losses

Source: Proactive Advisor Magazine, Dave Walton

Sample Size

Given the same test timeframe, a strategy that generates the highest number of trades is always the most reliable. In my view, sample size is the single most important determinant of future success. Surprisingly it’s rarely mentioned or discussed in articles, but ignoring it leads to disappointing live performance.

Chance or luck plays a significant role in the markets. As a result a strategy with a small sample size, or few number trades in a backtest, is more likely to generate test results that do not represent the true underlying population due to the effects of good or bad luck. The test results of such a model can appear to be overly good or bad, misleading traders and their expectations.

A good rule of thumb, and one I employ, is to shoot for at least 1000 trades in your backtest. You’ll come across people claiming that 30 trades is sufficient, but the truth is with the broad range of market conditions, 30 is simply not enough. A 1000 trade sample will reduce the effects of chance and more closely resemble the underlying edge of the strategy. If you can’t generate a large sample, then either abandon the strategy or make fair allowances in your testing and expectations. In the perfect world, we’d prefer to see millions of trades. In reality this is not always possible; just remember more is always better when it comes to sample size.

A Stringent Strategy Test

Two studies recently attempted to uncover strategies that have a high likelihood of providing outsized returns persistently in the future. They used five stringent requirements before evaluating a strategy, which have proven to be highly effective. I list them below:

1) Performance has been consistent over many years and has survived numerous database revisions as well as extensive out-of-sample data.
2) The strategy has been vetted, replicated, and debated in top academic journals over many years.
3) The strategy works across multiple asset classes.
4) Minor variations in definition/construction do not significantly impact performance.
5) There is a credible reason to offer a persistent edge.

Conclusion

Developing a strategy that adheres to the above requirements is challenging and can even take years of work. The effort is however certainly worthwhile when one consider what’s at stake – hard earned capital. I hope this provides you with some useful information to apply in your own testing. If you can build a strategy that checks all eleven requirements and passes the stringent test above, then you’re likely going to enjoy success with your approach. Good luck!

Happy Trading,
PJ