Friday, January 17, 2014

Causality and T-Consistency vs. Correlation and P-Consistency

Consider a standard linear regression setting with K regressors and sample size N. We will say that an estimator ˆβ is consistent for a treatment effect (``T-consistent") if plimˆβk=E(y|x)/xk, k=1,...,K; that is, if
(ˆβkE(y|x)xk)p0, k=1,...,K.
Hence in large samples ˆβk provides a good estimate of the effect on y of a one-unit ``treatment" performed on xk. T-consistency is the standard econometric notion of consistency. Unfortunately, however, OLS is of course T-consistent only under highly-stringent assumptions. Assessing and establishing credibility of those assumptions in any given application is what makes significant parts of econometrics so tricky.

Now consider a different notion of consistency. Assuming quadratic loss, the predictive risk of a parameter configuration β is
R(β)=E(yxβ)2.
Let B be a set of β's and let βB minimize R(β). We will say that ˆβ is consistent for a predictive effect (``P-consistent") if plimR(ˆβ)=R(β); that is, if
(R(ˆβ)R(β))p0.
Hence in large samples ˆβ provides a good way to predict y for any hypothetical x: simply use xˆβ. Crucially, OLS is essentially always P-consistent; we require almost no assumptions.

Let me elaborate slightly on P-consistency. That P-consistency holds is intuitively obvious for an extremum estimator like OLS, almost by definition. Of course OLS converges to the parameter configuration that optimizes quadratic predictive risk -- quadratic loss is the objective function that defines the OLS estimator. A rigorous treatment gets a bit involved, however, as do generalizations to allow things of relevance in Big Data environments, like K>N. See for example Greenshtein and Ritov (2004).

The distinction between P-consistency and T-consistency is clearly linked to the distinction between correlation and causality. As long as x and y are correlated, we can exploit the correlation (as captured in ˆβ) very generally to predict y given knowledge of x. That is, there will be a nonzero ``predictive effect" of x knowledge on y. But nonzero correlation  doesn't necessarily tell us anything about the causal ``treatment effect" of x treatments on y. That requires stringent assumptions. Even if there is a non-zero predictive effect of  x on y (as captured by ˆβOLS), there may or may not be a nonzero treatment effect of x on y, and even if nonzero it will generally not equal the predictive effect.

Here's a related reading recommendation. Check out Wasserman's brilliant Normal Deviate post, "Consistency, Sparsistency and Presistency." I agree entirely that the distinction between what he calls consistency (what I call T-consistency above) and what he calls presistency (what I call P-consistency above) is insufficiently appreciated. Indeed it took me twenty years to fully understand the distinction and its ramifications (although I'm slow). And it's crucially important.

The bottom line: In sharp contrast to T-consistency, P-consistency comes almost for free, yet it's the invaluable foundation on which all of (non-causal) predictive modeling builds. Would that such wonderful low-hanging fruit were more widely available!

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.