## Monday, October 6, 2014

### Intuition for Prediction Under Bregman Loss

Elements of the Bregman family of loss functions, denoted $$B(y, \hat{y})$$, take the form:
$$B(y, \hat{y}) = \phi(y) - \phi(\hat{y}) - \phi'(\hat{y}) (y-\hat{y})$$ where $$\phi: \mathcal{Y} \rightarrow R$$ is any strictly convex function, and $$\mathcal{Y}$$ is the support of $$Y$$.

Several readers have asked for intuition for equivalence between the predictive optimality of $$E[y|\mathcal{F}]$$ and Bregman loss function $$B(y, \hat{y})$$.  The simplest answers come from the proof itself, which is straightforward.

First consider $$B(y, \hat{y}) \Rightarrow E[y|\mathcal{F}]$$.  The derivative of expected Bregman loss with respect to $$\hat{y}$$ is
$$\frac{\partial}{\partial \hat{y}} E[B(y, \hat{y})] = \frac{\partial}{\partial \hat{y}} \int B(y,\hat{y}) \;f(y|\mathcal{F}) \; dy$$
$$= \int \frac{\partial}{\partial \hat{y}} \left ( \phi(y) - \phi(\hat{y}) - \phi'(\hat{y}) (y-\hat{y}) \right ) \; f(y|\mathcal{F}) \; dy$$
$$= \int (-\phi'(\hat{y}) - \phi''(\hat{y}) (y-\hat{y}) + \phi'(\hat{y})) \; f(y|\mathcal{F}) \; dy$$
$$= -\phi''(\hat{y}) \left( E[y|\mathcal{F}] - \hat{y} \right).$$
Hence the first order condition is
$$-\phi''(\hat{y}) \left(E[y|\mathcal{F}] - \hat{y} \right) = 0,$$
so the optimal forecast is the conditional mean, $$E[y|\mathcal{F}]$$.

Now consider $$E[y|\mathcal{F}] \Rightarrow B(y, \hat{y})$$. It's a simple task of reverse-engineering. We need the f.o.c. to be of the form
$$const \times \left(E[y|\mathcal{F}] - \hat{y} \right) = 0,$$
so that the optimal forecast is the conditional mean, $$E[y|\mathcal{F}]$$. Inspection reveals that $$B(y, \hat{y})$$ (and only $$B(y, \hat{y})$$) does the trick.

One might still want more intuition for the optimality of the conditional mean under Bregman loss, despite its asymmetry.  The answer, I conjecture, is that the Bregman family is not asymmetric! At least not for an appropriate definition of asymmetry in the general $$L(y, \hat{y})$$ case, which is more complicated and subtle than the $$L(e)$$ case.  Asymmetric loss plots like those in Patton (2014), on which I reported last week, are for fixed $$y$$ (in Patton's case, $$y=2$$ ), whereas for a complete treatment we need to look across all $$y$$. More on that soon.

[I would like to thank -- without implicating -- Minchul Shin for helpful discussions.]