Machine learning is hot, hot, hot. I can't imagine better instructors (or scholars) in the area than H&T (great videos), and the course is also a fine way to learn R. It's happening now (started just last week) and runs through late March. Just go to the course site to register. The book is James, Witten, Hastie and Tibshirani (JWHT), Introduction to Statistical Learning, with Applications in R, Springer, 2013. The book homepage has a free pdf download as well as a variety of related information.
Wednesday, January 29, 2014
Hastie-Tibshirani Statistical Learning Course Now Open
Machine learning is hot, hot, hot. I can't imagine better instructors (or scholars) in the area than H&T (great videos), and the course is also a fine way to learn R. It's happening now (started just last week) and runs through late March. Just go to the course site to register. The book is James, Witten, Hastie and Tibshirani (JWHT), Introduction to Statistical Learning, with Applications in R, Springer, 2013. The book homepage has a free pdf download as well as a variety of related information.
Friday, January 17, 2014
Causality and T-Consistency vs. Correlation and P-Consistency
Consider a standard linear regression setting with \(K\) regressors and sample size \(N\). We will say that an estimator \(\hat{\beta}\) is consistent for a treatment effect (``T-consistent") if \(plim \hat{\beta}_k = {\partial E(y|x) }/{\partial x_k}\), \(\forall k = 1, ..., K\); that is, if
$$
\left ( \hat{\beta}_k - \frac{\partial E(y|x) }{\partial x_k} \right ) \rightarrow_p 0, ~ \forall k = 1, ..., K.
$$ Hence in large samples \(\hat{\beta}_k\) provides a good estimate of the effect on \(y\) of a one-unit ``treatment" performed on \(x_k\). T-consistency is the standard econometric notion of consistency. Unfortunately, however, OLS is of course T-consistent only under highly-stringent assumptions. Assessing and establishing credibility of those assumptions in any given application is what makes significant parts of econometrics so tricky.
Now consider a different notion of consistency. Assuming quadratic loss, the predictive risk of a parameter configuration \(\beta\) is
$$
R(\beta) = {E}(y - x' \beta)^2.
$$ Let \(B\) be a set of \(\beta\)'s and let \(\beta^* \in B\) minimize \(R(\beta)\). We will say that \(\hat{\beta}\) is consistent for a predictive effect (``P-consistent") if \(plim R(\hat{\beta}) = R(\beta^*)\); that is, if
$$
\left ( R(\hat{\beta}) - R(\beta^*) \right ) \rightarrow_p 0.
$$ Hence in large samples \(\hat{\beta}\) provides a good way to predict \(y\) for any hypothetical \(x\): simply use \(x ' \hat{\beta}\). Crucially, OLS is essentially always P-consistent; we require almost no assumptions.
Let me elaborate slightly on P-consistency. That P-consistency holds is intuitively obvious for an extremum estimator like OLS, almost by definition. Of course OLS converges to the parameter configuration that optimizes quadratic predictive risk -- quadratic loss is the objective function that defines the OLS estimator. A rigorous treatment gets a bit involved, however, as do generalizations to allow things of relevance in Big Data environments, like \(K>N\). See for example Greenshtein and Ritov (2004).
The distinction between P-consistency and T-consistency is clearly linked to the distinction between correlation and causality. As long as \(x\) and \(y\) are correlated, we can exploit the correlation (as captured in \(\hat{\beta}\)) very generally to predict \(y\) given knowledge of \(x\). That is, there will be a nonzero ``predictive effect" of \(x\) knowledge on \(y\). But nonzero correlation doesn't necessarily tell us anything about the causal ``treatment effect" of \(x\) treatments on \(y\). That requires stringent assumptions. Even if there is a non-zero predictive effect of \(x\) on \(y\) (as captured by \(\hat{\beta}_{OLS}\)), there may or may not be a nonzero treatment effect of \(x\) on \(y\), and even if nonzero it will generally not equal the predictive effect.
Here's a related reading recommendation. Check out Wasserman's brilliant Normal Deviate post, "Consistency, Sparsistency and Presistency." I agree entirely that the distinction between what he calls consistency (what I call T-consistency above) and what he calls presistency (what I call P-consistency above) is insufficiently appreciated. Indeed it took me twenty years to fully understand the distinction and its ramifications (although I'm slow). And it's crucially important.
The bottom line: In sharp contrast to T-consistency, P-consistency comes almost for free, yet it's the invaluable foundation on which all of (non-causal) predictive modeling builds. Would that such wonderful low-hanging fruit were more widely available!
$$
\left ( \hat{\beta}_k - \frac{\partial E(y|x) }{\partial x_k} \right ) \rightarrow_p 0, ~ \forall k = 1, ..., K.
$$ Hence in large samples \(\hat{\beta}_k\) provides a good estimate of the effect on \(y\) of a one-unit ``treatment" performed on \(x_k\). T-consistency is the standard econometric notion of consistency. Unfortunately, however, OLS is of course T-consistent only under highly-stringent assumptions. Assessing and establishing credibility of those assumptions in any given application is what makes significant parts of econometrics so tricky.
Now consider a different notion of consistency. Assuming quadratic loss, the predictive risk of a parameter configuration \(\beta\) is
$$
R(\beta) = {E}(y - x' \beta)^2.
$$ Let \(B\) be a set of \(\beta\)'s and let \(\beta^* \in B\) minimize \(R(\beta)\). We will say that \(\hat{\beta}\) is consistent for a predictive effect (``P-consistent") if \(plim R(\hat{\beta}) = R(\beta^*)\); that is, if
$$
\left ( R(\hat{\beta}) - R(\beta^*) \right ) \rightarrow_p 0.
$$ Hence in large samples \(\hat{\beta}\) provides a good way to predict \(y\) for any hypothetical \(x\): simply use \(x ' \hat{\beta}\). Crucially, OLS is essentially always P-consistent; we require almost no assumptions.
Let me elaborate slightly on P-consistency. That P-consistency holds is intuitively obvious for an extremum estimator like OLS, almost by definition. Of course OLS converges to the parameter configuration that optimizes quadratic predictive risk -- quadratic loss is the objective function that defines the OLS estimator. A rigorous treatment gets a bit involved, however, as do generalizations to allow things of relevance in Big Data environments, like \(K>N\). See for example Greenshtein and Ritov (2004).
The distinction between P-consistency and T-consistency is clearly linked to the distinction between correlation and causality. As long as \(x\) and \(y\) are correlated, we can exploit the correlation (as captured in \(\hat{\beta}\)) very generally to predict \(y\) given knowledge of \(x\). That is, there will be a nonzero ``predictive effect" of \(x\) knowledge on \(y\). But nonzero correlation doesn't necessarily tell us anything about the causal ``treatment effect" of \(x\) treatments on \(y\). That requires stringent assumptions. Even if there is a non-zero predictive effect of \(x\) on \(y\) (as captured by \(\hat{\beta}_{OLS}\)), there may or may not be a nonzero treatment effect of \(x\) on \(y\), and even if nonzero it will generally not equal the predictive effect.
Here's a related reading recommendation. Check out Wasserman's brilliant Normal Deviate post, "Consistency, Sparsistency and Presistency." I agree entirely that the distinction between what he calls consistency (what I call T-consistency above) and what he calls presistency (what I call P-consistency above) is insufficiently appreciated. Indeed it took me twenty years to fully understand the distinction and its ramifications (although I'm slow). And it's crucially important.
The bottom line: In sharp contrast to T-consistency, P-consistency comes almost for free, yet it's the invaluable foundation on which all of (non-causal) predictive modeling builds. Would that such wonderful low-hanging fruit were more widely available!
Wednesday, January 15, 2014
DNS/AFNS Yield Curve Modeling FAQ's
It's hard to believe that I haven't yet said anything about yield-curve modeling and forecasting in the dynamic Nelson-Siegel (DNS) tradition, whether the original Diebold-Li (2006) DNS version or the Christensen-Diebold-Rudebusch (2011) arbitrage-free version (AFNS). Here are a few thoughts about where we are and where we're going, expressed as answers to FAQ's, drawn in part from the epilogue of a recent book, Diebold and Rudebusch (2012).
1. What's wrong with unrestricted affine equilibrium models?
The classic affine equilibrium models, although beautiful theoretical constructs, perform poorly in empirical practice. In particular, the maximally-flexible canonical \(A_0(N)\) models have notoriously recalcitrant likelihood surfaces. (Notation: \(A_x(N)\) means a model with \(N\) factors, \(x\) of which have stochastic volatility.) See Hamilton-Wu (2012) et al.
2. What's right with DNS/AFNS?
DNS/AFNS just puts a bit of structure on factor loadings while still maintaining significant flexibility. That gets us to a very good place, involving both theoretical rigor (via imposition of no-arb in AFNS) and empirical tractability. That's all. It really is that simple.
3. Is AFNS the only tractable \(A_0(3)\) model?
Not any longer, as recent important work has opened new doors. In particular, Joslin-Singleton-Zhu (2011) develop a well-behaved (among other things, identified!) family of Gaussian term structure models, for which trustworthy estimation is very simple, just as with AFNS. Moreover, it turns out that AFNS is nested within their canonical form, corresponding to three extra constraints relative to the maximally-flexible model.
Yes! AFNS's structure conveys several important and useful characteristics, which are presently difficult or impossible to achieve in competing frameworks. First, as regards specializations, AFNS parametric simplicity makes it easy to impose restrictions. Second, as regards extensions, AFNS simplicity makes it similarly easy to increase the number of AFNS latent factors if desired or necessary, as for example with the five-factor model of Christensen-Diebold-Rudebusch (2009). Third, as regards varied uses, the flexible AFNS continuous basis functions facilitate relative pricing, curve interpolation between observed yields, and risk measurement for arbitrary bond portfolios.
And there's more. Fascinating recent work studying AFNS from an approximation-theoretic perspective shows that the Nelson-Siegel form is a low-ordered Taylor-series approximation to an arbitrary \(A_0(N)\) model. See Krippner (in press).
5. What next?
Job 1 is flexible incorporation of stochastic volatility, moving from \(A_0(N)\) to \(A_x(N)\) for \(x>0\), as bond yields are most definitely conditionally heteroskedastic. Doing so is important for everything from estimating time-varying risk premia to forming correctly-calibrated interval and density forecasts. Work along those lines is starting to appear. Christensen-Lopez-Rudebusch (2010), Creal-Wu (2013) and Mauabbi (2013) are good recent examples.
Friday, January 10, 2014
Next (EC)^2 Meeting December 2014, Barcelona, "Advances in Forecasting"
Ever wonder what (EC)^2 means? It's "European Conferences of the Econom[etr]ics Community." There have been many fine (EC)^2 metings over the years since its 1990 inception, recently under the capable leadership of Luc Bauwens 2001-2013, and now under the equally-capable leadership of Peter Hansen. The next will be in Barcelona in December 2014,with the theme "Advances in Forecasting." The quality is high, and the group size is just right. A call for papers should circulate soon.
Wednesday, January 8, 2014
Elements of Statistical Learning: A Stunningly Good Job of LaTeX to pdf to Web
A very Happy New Year to all! Here's a little thing to start us off.
I happened to be thinking about principal-component regression vs. ridge regression yesterday, so as usual I consulted the Hastie-Tibshirani-Friedman (HTF) classic, Elements of Statistical Learning. Where did I get that gorgeous book pdf? (Look through it; the form is as wonderful as the substance, and see also the similarly-wonderful new James, Witten, Hastie and Tibshirani (JWHT), Introduction to Statistical Learning, with Applications in R.) Both are freely (and legally!) available as pdf on the web. Interestingly, both are also for sale by Springer in the usual ways.
So what's up? In path-breaking arrangements, HTF and JWHT negotiated deals in which they're free to post the book and Springer is free to sell it. And by all accounts the outcomes have been superb for all. Thanks, HTF and JWHT, for promoting best-practice science, and thanks Springer, for doing the right thing. May many more follow suit.