Peter Hansen and
Allan Timmermann have a fantastic new paper,
"Equivalence Between Out-of-Sample Forecast Comparisons and Wald Statistics."
The finite-sample wastefulness of (pseudo-) out-of-sample model comparisons seems obvious, as they effectively discard the (pseudo-) in-sample observations. That intuition should be true for both nested and non-nested comparisons, but it seems most obvious in the nested case: How could anything systematically dominate full-sample Wald, LR or LM for testing nested hypotheses? Hansen and Timmermann consider the nested case and verify the intuition with elegance and precision. In doing so they greatly clarify the misguided nature of most (pseudo-) out-of-sample model comparisons.
Consider the predictive regression model with

-period forecast horizon

































, where






and






. We obtain out-of-sample forecasts with recursively estimated parameter values by regressing


on























for







(resulting in the least squares estimate


















) and using






























to forecast




.
Now consider a smaller (nested) regression model,
In similar fashion we proceed by regressing


on






for







(resulting in the least squares estimate



) and using
to forecast




.
In a representative and leading contribution to the (pseudo-) out-of-sample model comparison literature in the tradition of West (1996), McCracken (2007) suggests comparing such nested models via expected loss evaluated at population parameters. Under quadratic loss the null hypothesis is







































McCracken considers the test statistic

























































where




is a consistent estimator of













and


is the number of observations set aside for the initial estimation of

, taken to be a fraction







of the full sample,


i.e.,







. The asymptotic null distribution of


turns out to be rather complicated; McCracken shows that it is a convolution of

independent random variables, each with a distribution of






























.
Hansen and Timmermann show that


is just the difference between two Wald statistics of the hypothesis that




, the first based on the full sample and the second based on the initial estimation sample. That is,


is just the increase in the Wald statistic obtained by using the full sample as opposed to the initial estimation sample. Hence the power of


derives entirely from the post-split sample, so it must be less powerful than using the entire sample. Indeed Hansen and Timmermann show that power decreases as

increases.
On the one hand, the Hansen-Timmermann results render trivial the calculation of


and greatly clarify its limit distribution (that of the difference between two independent


-distributions and their convolutions). So if one insists on doing


-type tests, then the Hansen-Timmermann results are good news. On the other hand, the
real news is bad: the Hansen-Timmerman results make clear that, at least in the environments they consider,
(pseudo-) out-of-sample model comparison comes at high cost (power reduction) and delivers no extra benefit.
[By the way, my paper, "Comparing Predictive Accuracy, Twenty Years Later: A Personal Perspective on the Use and Abuse of Diebold-Mariano Tests," makes many related points.
Drafts are here. The final (?) version will be delivered as the
JBES Invited Lecture at the January 2014 ASSA meetings in Philadelphia. Commentary at the meeting will be by Andrew Patton and Allan Timmerman. The
JBES published version will contain the Patton and Timmermann remarks, plus those of Atsushi Inoue, Lutz Kilian, and Jonathan Wright. Should be entertaining!]