No Hesitations: Causality and T-Consistency vs. Correlation and P-Consistency

Friday, January 17, 2014

Causality and T-Consistency vs. Correlation and P-Consistency

Consider a standard linear regression setting with $K$ regressors and sample size $N$. We will say that an estimator $\hat{\beta}$ is consistent for a treatment effect (``T-consistent") if $plim \hat{\beta}_k = {\partial E(y|x) }/{\partial x_k}$, $\forall k = 1, ..., K$; that is, if
$$
\left ( \hat{\beta}_k - \frac{\partial E(y|x) }{\partial x_k} \right ) \rightarrow_p 0, ~ \forall k = 1, ..., K.
$$ Hence in large samples $\hat{\beta}_k$ provides a good estimate of the effect on $y$ of a one-unit ``treatment" performed on $x_k$. T-consistency is the standard econometric notion of consistency. Unfortunately, however, OLS is of course T-consistent only under highly-stringent assumptions. Assessing and establishing credibility of those assumptions in any given application is what makes significant parts of econometrics so tricky.

Now consider a different notion of consistency. Assuming quadratic loss, the predictive risk of a parameter configuration $\beta$ is
$$
R(\beta) = {E}(y - x' \beta)^2.
$$ Let $B$ be a set of $\beta$'s and let $\beta^* \in B$ minimize $R(\beta)$. We will say that $\hat{\beta}$ is consistent for a predictive effect (``P-consistent") if $plim R(\hat{\beta}) = R(\beta^*)$; that is, if
$$
\left ( R(\hat{\beta}) - R(\beta^*) \right ) \rightarrow_p 0.
$$ Hence in large samples $\hat{\beta}$ provides a good way to predict $y$ for any hypothetical $x$: simply use $x ' \hat{\beta}$. Crucially, OLS is essentially always P-consistent; we require almost no assumptions.

Let me elaborate slightly on P-consistency. That P-consistency holds is intuitively obvious for an extremum estimator like OLS, almost by definition. Of course OLS converges to the parameter configuration that optimizes quadratic predictive risk -- quadratic loss is the objective function that defines the OLS estimator. A rigorous treatment gets a bit involved, however, as do generalizations to allow things of relevance in Big Data environments, like $K>N$. See for example Greenshtein and Ritov (2004).

The distinction between P-consistency and T-consistency is clearly linked to the distinction between correlation and causality. As long as $x$ and $y$ are correlated, we can exploit the correlation (as captured in $\hat{\beta}$) very generally to predict $y$ given knowledge of $x$. That is, there will be a nonzero ``predictive effect" of $x$ knowledge on $y$. But nonzero correlation doesn't necessarily tell us anything about the causal ``treatment effect" of $x$ treatments on $y$. That requires stringent assumptions. Even if there is a non-zero predictive effect of $x$ on $y$ (as captured by $\hat{\beta}_{OLS}$), there may or may not be a nonzero treatment effect of $x$ on $y$, and even if nonzero it will generally not equal the predictive effect.

Here's a related reading recommendation. Check out Wasserman's brilliant Normal Deviate post, "Consistency, Sparsistency and Presistency." I agree entirely that the distinction between what he calls consistency (what I call T-consistency above) and what he calls presistency (what I call P-consistency above) is insufficiently appreciated. Indeed it took me twenty years to fully understand the distinction and its ramifications (although I'm slow). And it's crucially important.

The bottom line: In sharp contrast to T-consistency, P-consistency comes almost for free, yet it's the invaluable foundation on which all of (non-causal) predictive modeling builds. Would that such wonderful low-hanging fruit were more widely available!

Econometrics, economics, finance, random rants.

Friday, January 17, 2014

Causality and T-Consistency vs. Correlation and P-Consistency

No comments:

Post a Comment

Get new posts by email: