$$B(y, \hat{y}) = \phi(y) - \phi(\hat{y}) - \phi'(\hat{y}) (y-\hat{y})

$$ where \(\phi: \mathcal{Y} \rightarrow R\) is any strictly convex function, and \(\mathcal{Y}\) is the support of \(Y\).

Several readers have asked for intuition for equivalence between the predictive optimality of \( E[y|\mathcal{F}]\) and Bregman loss function \(B(y, \hat{y})\). The simplest answers come from the proof itself, which is straightforward.

First consider \(B(y, \hat{y}) \Rightarrow E[y|\mathcal{F}]\). The derivative of expected Bregman loss with respect to \(\hat{y}\) is

$$

\frac{\partial}{\partial \hat{y}} E[B(y, \hat{y})] = \frac{\partial}{\partial \hat{y}} \int B(y,\hat{y}) \;f(y|\mathcal{F}) \; dy

$$

$$

= \int \frac{\partial}{\partial \hat{y}} \left ( \phi(y) - \phi(\hat{y}) - \phi'(\hat{y}) (y-\hat{y}) \right ) \; f(y|\mathcal{F}) \; dy

$$

$$

= \int (-\phi'(\hat{y}) - \phi''(\hat{y}) (y-\hat{y}) + \phi'(\hat{y})) \; f(y|\mathcal{F}) \; dy

$$

$$

= -\phi''(\hat{y}) \left( E[y|\mathcal{F}] - \hat{y} \right).

$$

Hence the first order condition is

$$

-\phi''(\hat{y}) \left(E[y|\mathcal{F}] - \hat{y} \right) = 0,

$$

so the optimal forecast is the conditional mean, \( E[y|\mathcal{F}] \).

Now consider \( E[y|\mathcal{F}] \Rightarrow B(y, \hat{y}) \). It's a simple task of reverse-engineering. We need the f.o.c. to be of the form

$$

const \times \left(E[y|\mathcal{F}] - \hat{y} \right) = 0,

$$

so that the optimal forecast is the conditional mean, \( E[y|\mathcal{F}] \). Inspection reveals that \( B(y, \hat{y}) \) (and only \( B(y, \hat{y}) \)) does the trick.

One might still want more intuition for the optimality of the conditional mean under Bregman loss,

*despite its asymmetry*. The answer, I conjecture, is that the Bregman family is

*not*asymmetric! At least not for an appropriate definition of asymmetry in the general \(L(y, \hat{y})\) case, which is more complicated and subtle than the \(L(e)\) case. Asymmetric loss plots like those in Patton (2014), on which I reported last week, are for

*fixed*\(y\) (in Patton's case, \(y=2\) ), whereas for a complete treatment we need to look across

*all*\(y\). More on that soon.

[I would like to thank -- without implicating -- Minchul Shin for helpful discussions.]