I'll present it as the JBES Lecture, January 2014 ASSA meetings, Philadelphia. Please join if you're around. It's Friday January 3, 2:30, Pennsylvania Convention Center Room 2004-C (I think).
By the way, the 2010 Peter Hansen paper that I now cite in my final paragraph, "A Winners Curse for Econometric Models: On the Joint Distribution of In-Sample Fit and Out-of-Sample Fit and its Implications for Model Selection," is tremendously insightful. I saw Peter present it a few years ago at a Stanford summer workshop, but I didn't fully appreciate it and had forgotten about it until he reminded me when he visited Penn last week. He's withheld the 2010 and later revisions from general circulation evidently because one section still needs work. Let's hope that he gets it revised and posted soon! (A more preliminary 2009 version remains online from a University of Chicago seminar.) One of Peter's key points is that although split-sample model comparisons can be "tricked" by data mining in finite samples, just as can all model comparison procedures, split-sample comparisons appear to be harder to trick, in a sense that he makes precise. That's potentially a very big deal.
Comparing Predictive Accuracy, Twenty Years Later: A Personal Perspective on the Use and Abuse of Diebold-Mariano Tests
Abstract: The Diebold-Mariano (DM) test was intended for comparing forecasts; it has been, and remains, useful in that regard. The DM test was not intended for comparing models. Much of the large ensuing literature, however, uses DM-type tests for comparing models, in (pseudo-) out-of-sample environments. In that case, simpler yet more compelling full-sample model comparison procedures exist; they have been, and should continue to be, widely used. The hunch that (pseudo-) out-of-sample analysis is somehow the ``only," or ``best," or even necessarily a ``good" way to provide insurance against in-sample over-fitting in model comparisons proves largely false. On the other hand, (pseudo-) out-of-sample analysis remains useful for certain tasks, most notably for providing information about comparative predictive performance during particular historical episodes.