## Friday, January 17, 2014

### Causality and T-Consistency vs. Correlation and P-Consistency

Consider a standard linear regression setting with $$K$$ regressors and sample size $$N$$. We will say that an estimator $$\hat{\beta}$$ is consistent for a treatment effect (T-consistent") if $$plim \hat{\beta}_k = {\partial E(y|x) }/{\partial x_k}$$, $$\forall k = 1, ..., K$$; that is, if
$$\left ( \hat{\beta}_k - \frac{\partial E(y|x) }{\partial x_k} \right ) \rightarrow_p 0, ~ \forall k = 1, ..., K.$$ Hence in large samples $$\hat{\beta}_k$$ provides a good estimate of the effect on $$y$$ of a one-unit treatment" performed on $$x_k$$. T-consistency is the standard econometric notion of consistency. Unfortunately, however, OLS is of course T-consistent only under highly-stringent assumptions. Assessing and establishing credibility of those assumptions in any given application is what makes significant parts of econometrics so tricky.

Now consider a different notion of consistency. Assuming quadratic loss, the predictive risk of a parameter configuration $$\beta$$ is
$$R(\beta) = {E}(y - x' \beta)^2.$$ Let $$B$$ be a set of $$\beta$$'s and let $$\beta^* \in B$$ minimize $$R(\beta)$$. We will say that $$\hat{\beta}$$ is consistent for a predictive effect (P-consistent") if $$plim R(\hat{\beta}) = R(\beta^*)$$; that is, if
$$\left ( R(\hat{\beta}) - R(\beta^*) \right ) \rightarrow_p 0.$$ Hence in large samples $$\hat{\beta}$$ provides a good way to predict $$y$$ for any hypothetical $$x$$: simply use $$x ' \hat{\beta}$$. Crucially, OLS is essentially always P-consistent; we require almost no assumptions.

Let me elaborate slightly on P-consistency. That P-consistency holds is intuitively obvious for an extremum estimator like OLS, almost by definition. Of course OLS converges to the parameter configuration that optimizes quadratic predictive risk -- quadratic loss is the objective function that defines the OLS estimator. A rigorous treatment gets a bit involved, however, as do generalizations to allow things of relevance in Big Data environments, like $$K>N$$. See for example Greenshtein and Ritov (2004).

The distinction between P-consistency and T-consistency is clearly linked to the distinction between correlation and causality. As long as $$x$$ and $$y$$ are correlated, we can exploit the correlation (as captured in $$\hat{\beta}$$) very generally to predict $$y$$ given knowledge of $$x$$. That is, there will be a nonzero predictive effect" of $$x$$ knowledge on $$y$$. But nonzero correlation  doesn't necessarily tell us anything about the causal treatment effect" of $$x$$ treatments on $$y$$. That requires stringent assumptions. Even if there is a non-zero predictive effect of  $$x$$ on $$y$$ (as captured by $$\hat{\beta}_{OLS}$$), there may or may not be a nonzero treatment effect of $$x$$ on $$y$$, and even if nonzero it will generally not equal the predictive effect.

Here's a related reading recommendation. Check out Wasserman's brilliant Normal Deviate post, "Consistency, Sparsistency and Presistency." I agree entirely that the distinction between what he calls consistency (what I call T-consistency above) and what he calls presistency (what I call P-consistency above) is insufficiently appreciated. Indeed it took me twenty years to fully understand the distinction and its ramifications (although I'm slow). And it's crucially important.

The bottom line: In sharp contrast to T-consistency, P-consistency comes almost for free, yet it's the invaluable foundation on which all of (non-causal) predictive modeling builds. Would that such wonderful low-hanging fruit were more widely available!