Friday, January 17, 2014

Causality and T-Consistency vs. Correlation and P-Consistency

Consider a standard linear regression setting with \(K\) regressors and sample size \(N\). We will say that an estimator \(\hat{\beta}\) is consistent for a treatment effect (``T-consistent") if \(plim \hat{\beta}_k = {\partial E(y|x) }/{\partial x_k}\), \(\forall k = 1, ..., K\); that is, if
\left ( \hat{\beta}_k - \frac{\partial E(y|x) }{\partial x_k} \right )   \rightarrow_p 0, ~ \forall k = 1, ..., K.
$$ Hence in large samples \(\hat{\beta}_k\) provides a good estimate of the effect on \(y\) of a one-unit ``treatment" performed on \(x_k\). T-consistency is the standard econometric notion of consistency. Unfortunately, however, OLS is of course T-consistent only under highly-stringent assumptions. Assessing and establishing credibility of those assumptions in any given application is what makes significant parts of econometrics so tricky.

Now consider a different notion of consistency. Assuming quadratic loss, the predictive risk of a parameter configuration \(\beta\) is
R(\beta) = {E}(y - x' \beta)^2.
$$ Let \(B\) be a set of \(\beta\)'s and let \(\beta^* \in B\) minimize \(R(\beta)\). We will say that \(\hat{\beta}\) is consistent for a predictive effect (``P-consistent") if \(plim R(\hat{\beta}) = R(\beta^*)\); that is, if
\left ( R(\hat{\beta}) - R(\beta^*) \right ) \rightarrow_p 0.
$$ Hence in large samples \(\hat{\beta}\) provides a good way to predict \(y\) for any hypothetical \(x\): simply use \(x ' \hat{\beta}\). Crucially, OLS is essentially always P-consistent; we require almost no assumptions.

Let me elaborate slightly on P-consistency. That P-consistency holds is intuitively obvious for an extremum estimator like OLS, almost by definition. Of course OLS converges to the parameter configuration that optimizes quadratic predictive risk -- quadratic loss is the objective function that defines the OLS estimator. A rigorous treatment gets a bit involved, however, as do generalizations to allow things of relevance in Big Data environments, like \(K>N\). See for example Greenshtein and Ritov (2004).

The distinction between P-consistency and T-consistency is clearly linked to the distinction between correlation and causality. As long as \(x\) and \(y\) are correlated, we can exploit the correlation (as captured in \(\hat{\beta}\)) very generally to predict \(y\) given knowledge of \(x\). That is, there will be a nonzero ``predictive effect" of \(x\) knowledge on \(y\). But nonzero correlation  doesn't necessarily tell us anything about the causal ``treatment effect" of \(x\) treatments on \(y\). That requires stringent assumptions. Even if there is a non-zero predictive effect of  \(x\) on \(y\) (as captured by \(\hat{\beta}_{OLS}\)), there may or may not be a nonzero treatment effect of \(x\) on \(y\), and even if nonzero it will generally not equal the predictive effect.

Here's a related reading recommendation. Check out Wasserman's brilliant Normal Deviate post, "Consistency, Sparsistency and Presistency." I agree entirely that the distinction between what he calls consistency (what I call T-consistency above) and what he calls presistency (what I call P-consistency above) is insufficiently appreciated. Indeed it took me twenty years to fully understand the distinction and its ramifications (although I'm slow). And it's crucially important.

The bottom line: In sharp contrast to T-consistency, P-consistency comes almost for free, yet it's the invaluable foundation on which all of (non-causal) predictive modeling builds. Would that such wonderful low-hanging fruit were more widely available!