Forward-Test Methodology

Forward Testing: How We Prove a System Isn’t Just a Backtest That Got Lucky

Two engines. Eleven independent tests. Ten passes, one honest WEAK verdict, zero failures swept under the rug.

MRE v06: 7 / 7 tests passed Arsenal BTC: 3 PASS, 1 WEAK 8th Rule: legacy appendix Last updated: April 2026
01 — What Is Forward Testing

The Backtest Is Only the Start

A backtest tells you how a system did perform on the data it was tuned against. That’s table stakes. The harder question is: how might it perform on data it hasn’t seen, on conditions that haven’t happened, and against statistically-similar but different versions of history?

There’s no perfect way to predict the future. But there are rigorous ways to stress-test whether an edge is real. Each engine on this site — MRE v06 for SPY, Arsenal BTC for Bitcoin — ships with its own forward-test battery. The tables below are the verdicts. Where a test is rated something other than PASS, it’s named explicitly and explained.

If a forward-test page only shows PASS results, it’s a marketing page. A real forward-test page discloses its WEAK and MIXED verdicts up front and tells you why they’re still acceptable.

The four robustness questions every test is trying to answer:

01

“Does it hold up out of sample?”

Walk-forward tests split history into train / test halves. The system tunes on the train half and is then frozen against the test half. If parameters are curve-fit, the test-period Sharpe collapses. If the system has real edge, the test-period Sharpe holds up — or even improves.

02

“Is the result sequence-lucky?”

Monte Carlo block-bootstrap takes the engine’s daily returns, chops them into blocks, and re-shuffles them thousands of times. If the actual Sharpe lands at the median of that distribution, the result isn’t a lucky ordering. If it lands at the 99th percentile, it might be.

03

“Is it perched on a parameter spike?”

Parameter sensitivity probes every knob ±1 and ±2 steps from default. A robust engine sits on a plateau where small parameter perturbations cost almost nothing. A curve-fit engine sits on a spike where one wrong setting collapses the whole thing.

04

“Would it work on a different version of history?”

Synthetic price simulation generates thousands of statistically-similar but different price paths and re-runs the engine’s position schedule against each. If the actual Sharpe lands in the top percentile of that distribution, the engine is working with the data, not against it.

02 — MRE v06

MRE v06 — Seven Tests. Zero Failures.

MRE v06 ships with the most extensive forward-test battery on the site — seven independent probes, each designed to attack a different robustness assumption. Every test passes. The single most important result is the OOS walk-forward (Test 1): the test-period Sharpe was actually higher than the training-period Sharpe.

Test What It Asks Result Verdict
1. OOS Walk-Forward Does it hold up on data the engine has never seen? Train Sharpe 0.754 → Test Sharpe 1.219. The test period’s Sharpe was 62% higher than training. Pass
2. Monte Carlo (Block Bootstrap) Is the return sequence just lucky? Actual Sharpe lands at the 50.9th percentile of 10,000 block-bootstraps. Median, not lucky. Pass
3. Regime Stability + Crisis Stress Does it behave sanely when markets break? Beat B&H in 6 of 10 named crises. 2008 GFC +44.5% relative outperformance, 2020 COVID +26.7%. Pass
4. Parameter Sensitivity Does it collapse if a knob is one step off? ±1 step from default retains 96.1%+ of baseline Sharpe across every parameter probed. Pass
5. New Input Impact Are we missing any high-impact signals? No candidate input (DFII10, IG_OAS, T10Y3M) improves the composite score meaningfully. Equivalence guard passes 5,860 / 5,860 days. Pass
6. Synthetic Price (SPS) Would it work on a statistically-similar but different SPY history? 2,000 paths each. MODE A (block-bootstrap) p5 Sharpe +0.088, p95 Sharpe +0.96. MODE B (Gaussian GBM) p5 +0.069, p95 +0.94. Real Sharpe at the 99th percentile of both distributions. Pass
7. Rolling OOS (1-yr & 3-yr) Is performance concentrated in a few big years? 16 of 21 rolling years (76%) post a positive Sharpe; zero catastrophic years; 3-year rolling median Sharpe of 1.00. Pass

Test 1 in plain English

The most important test in the suite. I split MRE v06’s data in two: train on 2003-2015 (12.6 years), then freeze the parameters and test on 2016-2026 (10.5 years). If the parameters were curve-fit, you’d expect the test-period Sharpe to collapse. It didn’t — it improved by 62% (0.754 → 1.219). The frozen parameters handled the modern era (COVID, the 2022 rate shock, the 2023 banking stress, the AI-cycle expansion) better than they handled the relatively-calm 2003-2015 training era.

Test 6 in plain English

The hardest test to explain, the most rigorous one to fail. Real SPY is one path through history. Synthetic Price Simulation generates 2,000 alternate SPY paths two ways — MODE A uses block-bootstrap (preserves fat tails, vol clustering, regime stickiness), MODE B uses GBM Gaussian (clean, no fat tails). The engine’s position schedule is then re-run against each synthetic path. Under both modes the engine’s real Sharpe lands at the 99th percentile of 2,000 simulations — meaning the engine is exploiting real structure that holds up across thousands of alternate versions of history, not riding one lucky SPY path.

The honest framing

Seven PASS verdicts is a strong result. It is not a guarantee. Every forward-test methodology has assumptions. Tests 2 and 6 assume the future’s statistical properties resemble the past’s — if that breaks (regime shift, structural change in market behaviour), forward-tested results stop being predictive. The test suite shrinks the space of bad outcomes; it does not eliminate it.

Full details on the MRE v06 backtest itself, including the 23-year equity curve and per-regime statistics, live on the MRE v06 product page.

03 — Arsenal BTC

Arsenal BTC — Four Tests. One Honest WEAK.

Arsenal BTC ships with the standard four-test forward battery shared across the Arsenal family (SPY, ETH, Gold, BTC use the same VAMS architecture, each with its own parameter tune). Three of the four tests pass. One — the OOS walk-forward — is rated WEAK, not PASS. We name it up front because that’s exactly what forward-testing is for.

Test Result Verdict
1. OOS Walk-Forward (60/40) Train Sharpe 2.93 → Test Sharpe 1.55. The drop is almost entirely explained by the 2022-2024 BTC bear market sitting in the test window — a regime that genuinely challenges any trend system. Weak
2. Monte Carlo (2,000 Paths) p5 Sharpe 1.39, median 2.28, p95 3.32. P(Sharpe > 0) = 100%. No bootstrap path produced a losing strategy. Pass
3. Crisis Stress (9 Events) 2018 BTC bear −29.7% vs B&H −81.4%. Strong in extended drawdown events (2018, 2022), weaker in 2024-25 chop. Mixed but not catastrophic. Mixed
4. Parameter Sensitivity ±1 steps from default degrade gracefully. ±2 shows some asymmetry but no collapse. Engine is on a plateau, not a spike. Pass
Honest note on the OOS WEAK verdict

The OOS walk-forward verdict is WEAK, not PASS. Here’s why that’s honest: the test period (2022-2024) was Bitcoin’s worst bear market since 2018. The system returned a Sharpe of 1.55 in that window — strong in absolute terms, but well below the 2.93 training-period Sharpe. The drop isn’t evidence the engine is broken; it’s evidence that any trend-following system is going to grind in a multi-year sideways bear market. Monte Carlo (Test 2) and parameter sensitivity (Test 4) both pass cleanly. We disclose the weaker test result because this is what forward-testing is FOR.

What MIXED means on Test 3

The crisis-stress test runs the engine through 9 named Bitcoin crises (2018 bear, 2020 COVID, May 2021 leverage flush, June 2022 Luna/3AC, November 2022 FTX, etc.). Arsenal BTC dominates extended drawdown events — the kind where the trend genuinely breaks and the vol regime shifts (2018, 2022). It struggles in fast vol-spike-then-recovery events where a naive buy-and-hold benefits from the bounce while the vol regime layer keeps the engine in cash. The 2024-25 choppy bull market hit some of these. MIXED is a fair verdict — the engine wins where it’s designed to win, loses where buy-and-hold has the structural advantage.

Why we still ship Arsenal BTC despite WEAK on OOS

A 1.55 Sharpe in a multi-year bear market is genuinely strong — better than buy-and-hold, better than most published Bitcoin trend systems’ full-period numbers. The training Sharpe of 2.93 is exceptional and reflects the historical favourable mix of bull and bear regimes. We’d expect the live Sharpe to land between those two numbers depending on the regime mix in the live period. The shape of the result — trend-plus-vol-regime cuts drawdown by more than half across every regime — is what we’re selling, and that holds up cleanly across all four tests.

Full details on the Arsenal BTC backtest itself, including the parameter tune and the full 10-year equity curve, live on the Arsenal BTC product page.

04 — Legacy: The 8th Rule Forward Tests

Historical Validation — The 8th Rule

The 8th Rule has been demoted from the flagship Bitcoin product to an aggressive alternative available to subscribers who prefer it. The forward tests below are still accurate for that product — they have not been re-run because the 8th Rule itself is frozen. They’re kept here as historical validation, not as current marketing.

For the current flagship Bitcoin engine, see Arsenal BTC (Section 3 above). For more on why the 8th Rule was demoted and what it’s still good for, see the 8th Rule product page.

Open the 8th Rule forward-test results ↓

Test A — Monte Carlo on Trade Sequences

Took the strategy’s 40 historical trade cycles, randomly resampled them 10,000 times across multiple time horizons. Three different resampling methods (with-replacement, without-replacement, block-shuffle) all produced consistent results.

Metric 1 Year 2 Years 3 Years 5 Years
Median Return 43% 146% 327% 1,099%
Median CAGR 43% 57% 62% 64%
Chance of Profit 82.7% 93.3% 97.2% 99.4%
Bear Case (Bottom 5%) −15% −5% +17% +93%
Median Max Drawdown 8% 11% 14% 17%

Across 10,000 resamplings, the 8th Rule’s 3-year profit chance was 97.2%, with a bear-case (bottom-5%) outcome of +17% and a median max drawdown of 14%.

Test B — Walk-Forward Across 6 Windows

Anchored expanding windows, each testing on the next 5 trade cycles. Out-of-sample win rate held at 88% of the in-sample rate. All 6 OOS windows were profitable. Five non-overlapping cohorts (each 8 trades) all returned positive results, with the weakest cohort (mid-2021 to mid-2023 chop) still at 1.12× total multiplier.

Test C — Synthetic Price Paths

Generated 1,000 synthetic Bitcoin price paths with three regimes (bull / bear / chop), fat-tailed returns, and occasional jumps. Re-implemented the full GVTS + VATS strategy against each path. The strategy was profitable on 77.5% of synthetic paths and produced a median terminal value of 2.10× vs buy-and-hold’s 1.61×. Median max drawdown 43% vs buy-and-hold’s 61%.

The full historical write-up for these three tests originally lived on this page in February 2026 and has been preserved verbatim above. The methodology is unchanged. The 8th Rule itself has not been re-tuned since.

05 — Why This Matters

Forward Testing Is Necessary, Not Sufficient

Three different methods. Three different questions. The same answer: the edge is real but not bulletproof. Treat the bear case as your baseline.

Across MRE v06 and Arsenal BTC, eleven independent forward tests were run. Ten passed. One was rated WEAK and is named explicitly. None failed. This is what an honest forward-test page looks like — if every test had passed, I’d be more suspicious of the methodology, not less. Real systems have weak edges in some conditions; the goal of forward testing is to find them and admit them, not hide them.

What forward testing CAN tell you

Whether the edge is curve-fit

OOS walk-forward (Tests 1 in both batteries) is the canonical curve-fit detector. MRE v06’s test Sharpe was higher than its training Sharpe — the strongest possible result. Arsenal BTC’s test Sharpe was lower than training but still strongly positive (1.55) — a result we’ve named WEAK and explained.

Whether the result is sequence-lucky

Monte Carlo block-bootstrap (Tests 2 in both batteries) tells you whether the actual return sequence sits at the median or the tail of the distribution of possible orderings. MRE v06: 50.9th percentile (median). Arsenal BTC: median Sharpe of 2.28, p5 of 1.39, P(Sharpe > 0) = 100%.

Whether the parameters are on a plateau or a spike

Parameter sensitivity (Tests 4 in both batteries) probes ±1 / ±2 steps off default. Both engines sit on plateaus, not spikes. MRE v06 retains 96.1%+ of baseline Sharpe at ±1; Arsenal BTC degrades gracefully and shows no collapse at ±2.

What forward testing CANNOT tell you

No forward test can predict regime change

Every forward-test methodology assumes the future’s statistical properties resemble the past’s — that fat tails are still fat-tailed, that regimes still flip the way they have, that Bitcoin still has four regimes and SPY still has growth-and-inflation cycles. If any of those assumptions breaks (regulatory black swan, structural market-microstructure shift, fundamental change in how either asset trades), forward-tested results stop being predictive. Forward testing shrinks the space of bad outcomes; it does not eliminate it.

Use the bear case as your planning baseline. The p5 number (bottom 5% of outcomes) is the one to size around. If you can survive the bear case comfortably, the median and bull case take care of themselves. For Arsenal BTC specifically, the planning baseline is the 1.55 OOS test Sharpe — not the 2.32 full-period Sharpe.

This is the standard we hold the engines to. New tests are added when new robustness questions become tractable; old tests are re-run when an engine is materially re-tuned. The version-stamped “last updated” date at the top of each engine’s page tells you the most recent forward-test refresh.

Bottom line

The Bottom Line
Eleven Tests. Ten Passes. One Honest WEAK.

MRE v06 has the strongest forward-test record on the site — seven probes, all pass, including a rare result where the OOS test Sharpe exceeded the training Sharpe. Arsenal BTC has a strong but not bulletproof record — three passes, one WEAK on OOS walk-forward (which we explain rather than hide). The 8th Rule’s legacy forward tests sit in an appendix because they’re still accurate for that product but it’s no longer the flagship. That’s the full picture — nothing redacted, nothing hidden in a footnote.

— Durden out.

Join DurdenBTC on Substack

MRE v06 + Arsenal BTC + the full members dashboard — starting at $7/mo.

Last updated: April 2026. MRE v06 forward tests cover SPY data from 2003-01-02 through 2026-04-28 (5,867 trading days). Arsenal BTC forward tests cover BTC-USD spot from 2015-09-28 through 2026-04-28 (10.58 years). The 8th Rule legacy forward tests cover Bitcoin from 2014 through early 2026 with 40 historical trade cycles. This content is for educational and informational purposes only. It does not constitute financial advice, investment advice, or a recommendation to buy or sell any asset. Trading involves substantial risk of loss. Past performance, whether backtested or live, does not guarantee future results. Backtested and forward-tested performance has inherent limitations: it is designed with the benefit of hindsight, does not reflect actual trading, and does not account for all factors that may affect real-world execution. Forward-test results assume the future’s statistical properties resemble the past’s, which is not guaranteed. The author is not a licensed financial advisor. Always do your own research and consult a qualified financial professional before making investment decisions.