Let's strip things to a starkly simple stylized case. (The basic idea generalizes to much richer environments.) Consider a time-series forecasting situation with a one-shot structural break in all model parameters at a known past time. Should you simply discard the pre-break data when estimating your model? Your first reaction might be yes, as the pre-break regime is irrelevant moving forward, and your goal is forecasting.
But the correct answer is "not necessarily." Of course using the full sample will produce a mongrel estimation blend of pre-break and post-break parameters. That is, using the full sample will produce biased estimates of the relevant post-break parameters. But using the full sample may greatly reduce variance, so estimation mean-squared error of post-break parameters may be lower, perhaps much lower, when estimating using the full sample, which may then translate into lower out-of-sample mean-squared prediction error (MSPE) when using the full-sample estimated parameters.
Suppose, to take a stark example, that the break is miniscule, and that it's near the end of a very long sample. The cost of full-sample estimation is then injection of miniscule bias in estimates of the relevant post-break parameters, whereas the benefit is massive variance reduction. That's a very favorable tradeoff under quadratic loss.
Fine. Good insight. (Interesting historical note: Hashem mentions that it originated in a conversation with the late Benoit Mandelbrot.) But there's much more, and here's where it gets really interesting. Just as it's sub-optimal simply to discard the pre-break data, it's also sub-optimal simply to keep it. It turns out, quite intuitively, that you want to keep the pre-break data but downweight it, and the MSPE-optimal weight-decay scheme turns out to be exponential! In less-rigid forecasting environments involving continuous structural evolution (also considered by P^3), that basically amounts to exponential smoothing. Very, very cool.
Strangely, P^3 don't attempt to connect to the well-known and clearly-related work of Mike Clements and David Hendry (CH), which P^3 clarify and extend significantly. In several papers, CH take ES as an exogenously existing method and ask why it's often hard to beat, and closely related, why the martingale "no change" forecast is often hard to beat. See, e.g., section 7 of their survey, "Forecasting with Breaks," in Elliott, Granger and Timmermann, (eds.) Handbook of Economic Forecasting, 2005, Elsevier. (Sorry, not even a working paper version available free online, thanks to Elsevier's greed.) The CH answer is that breaks happen often in economics, and that ES performs well because it adapts to breaks quickly. P^3 instead ask what approach is optimal in various break environments and arrive endogenously at ES. Moreover, the P^3 results are interestingly nuanced. For example, ES is closest to optimality in situations of continuous structural change, not in situations of discrete breaks as emphasized by CH. In any event, the P^3 and CH results are marvelously complementary.
Exponential smoothing is again alive and well, from yet another, and very different, perspective.