Monday, October 6, 2014

Intuition for Prediction Under Bregman Loss

Elements of the Bregman family of loss functions, denoted \(B(y, \hat{y})\), take the form:
$$B(y, \hat{y}) = \phi(y) - \phi(\hat{y}) - \phi'(\hat{y}) (y-\hat{y})
$$ where \(\phi: \mathcal{Y} \rightarrow R\) is any strictly convex function, and \(\mathcal{Y}\) is the support of \(Y\).

Several readers have asked for intuition for equivalence between the predictive optimality of \( E[y|\mathcal{F}]\) and Bregman loss function \(B(y, \hat{y})\).  The simplest answers come from the proof itself, which is straightforward.

First consider \(B(y, \hat{y}) \Rightarrow E[y|\mathcal{F}]\).  The derivative of expected Bregman loss with respect to \(\hat{y}\) is
$$
\frac{\partial}{\partial \hat{y}} E[B(y, \hat{y})] = \frac{\partial}{\partial \hat{y}} \int B(y,\hat{y}) \;f(y|\mathcal{F}) \; dy
$$
$$
=  \int \frac{\partial}{\partial \hat{y}} \left ( \phi(y) - \phi(\hat{y}) - \phi'(\hat{y}) (y-\hat{y}) \right ) \; f(y|\mathcal{F}) \; dy
$$
$$
=  \int (-\phi'(\hat{y}) - \phi''(\hat{y}) (y-\hat{y}) + \phi'(\hat{y})) \; f(y|\mathcal{F}) \; dy
$$
$$
= -\phi''(\hat{y}) \left( E[y|\mathcal{F}] - \hat{y} \right).
$$
Hence the first order condition is
$$
-\phi''(\hat{y}) \left(E[y|\mathcal{F}] - \hat{y} \right) = 0,
$$
so the optimal forecast is the conditional mean, \( E[y|\mathcal{F}] \).

Now consider \( E[y|\mathcal{F}] \Rightarrow B(y, \hat{y}) \). It's a simple task of reverse-engineering. We need the f.o.c. to be of the form
$$
const \times \left(E[y|\mathcal{F}] - \hat{y} \right) = 0,
$$
so that the optimal forecast is the conditional mean, \( E[y|\mathcal{F}] \). Inspection reveals that \( B(y, \hat{y}) \) (and only \( B(y, \hat{y}) \)) does the trick.

One might still want more intuition for the optimality of the conditional mean under Bregman loss, despite its asymmetry.  The answer, I conjecture, is that the Bregman family is not asymmetric! At least not for an appropriate definition of asymmetry in the general \(L(y, \hat{y})\) case, which is more complicated and subtle than the \(L(e)\) case.  Asymmetric loss plots like those in Patton (2014), on which I reported last week, are for fixed \(y\) (in Patton's case, \(y=2\) ), whereas for a complete treatment we need to look across all \(y\). More on that soon.

[I would like to thank -- without implicating -- Minchul Shin for helpful discussions.]