(ˆβk−∂E(y|x)∂xk)→p0, ∀k=1,...,K.
Hence in large samples ˆβk provides a good estimate of the effect on y of a one-unit ``treatment" performed on xk. T-consistency is the standard econometric notion of consistency. Unfortunately, however, OLS is of course T-consistent only under highly-stringent assumptions. Assessing and establishing credibility of those assumptions in any given application is what makes significant parts of econometrics so tricky.
Now consider a different notion of consistency. Assuming quadratic loss, the predictive risk of a parameter configuration β is
R(β)=E(y−x′β)2.
Let B be a set of β's and let β∗∈B minimize R(β). We will say that ˆβ is consistent for a predictive effect (``P-consistent") if plimR(ˆβ)=R(β∗); that is, if
(R(ˆβ)−R(β∗))→p0.
Hence in large samples ˆβ provides a good way to predict y for any hypothetical x: simply use x′ˆβ. Crucially, OLS is essentially always P-consistent; we require almost no assumptions.
Let me elaborate slightly on P-consistency. That P-consistency holds is intuitively obvious for an extremum estimator like OLS, almost by definition. Of course OLS converges to the parameter configuration that optimizes quadratic predictive risk -- quadratic loss is the objective function that defines the OLS estimator. A rigorous treatment gets a bit involved, however, as do generalizations to allow things of relevance in Big Data environments, like K>N. See for example Greenshtein and Ritov (2004).
The distinction between P-consistency and T-consistency is clearly linked to the distinction between correlation and causality. As long as x and y are correlated, we can exploit the correlation (as captured in ˆβ) very generally to predict y given knowledge of x. That is, there will be a nonzero ``predictive effect" of x knowledge on y. But nonzero correlation doesn't necessarily tell us anything about the causal ``treatment effect" of x treatments on y. That requires stringent assumptions. Even if there is a non-zero predictive effect of x on y (as captured by ˆβOLS), there may or may not be a nonzero treatment effect of x on y, and even if nonzero it will generally not equal the predictive effect.
Here's a related reading recommendation. Check out Wasserman's brilliant Normal Deviate post, "Consistency, Sparsistency and Presistency." I agree entirely that the distinction between what he calls consistency (what I call T-consistency above) and what he calls presistency (what I call P-consistency above) is insufficiently appreciated. Indeed it took me twenty years to fully understand the distinction and its ramifications (although I'm slow). And it's crucially important.
The bottom line: In sharp contrast to T-consistency, P-consistency comes almost for free, yet it's the invaluable foundation on which all of (non-causal) predictive modeling builds. Would that such wonderful low-hanging fruit were more widely available!
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.