No Hesitations: Intuition for Prediction Under Bregman Loss

Monday, October 6, 2014

Intuition for Prediction Under Bregman Loss

Elements of the Bregman family of loss functions, denoted $B(y, \hat{y})$, take the form:
$$B(y, \hat{y}) = \phi(y) - \phi(\hat{y}) - \phi'(\hat{y}) (y-\hat{y})
$$ where $\phi: \mathcal{Y} \rightarrow R$ is any strictly convex function, and $\mathcal{Y}$ is the support of $Y$.

Several readers have asked for intuition for equivalence between the predictive optimality of $ E[y|\mathcal{F}]$ and Bregman loss function $B(y, \hat{y})$. The simplest answers come from the proof itself, which is straightforward.

First consider $B(y, \hat{y}) \Rightarrow E[y|\mathcal{F}]$. The derivative of expected Bregman loss with respect to $\hat{y}$ is
$$
\frac{\partial}{\partial \hat{y}} E[B(y, \hat{y})] = \frac{\partial}{\partial \hat{y}} \int B(y,\hat{y}) \;f(y|\mathcal{F}) \; dy
$$
$$
= \int \frac{\partial}{\partial \hat{y}} \left ( \phi(y) - \phi(\hat{y}) - \phi'(\hat{y}) (y-\hat{y}) \right ) \; f(y|\mathcal{F}) \; dy
$$
$$
= \int (-\phi'(\hat{y}) - \phi''(\hat{y}) (y-\hat{y}) + \phi'(\hat{y})) \; f(y|\mathcal{F}) \; dy
$$
$$
= -\phi''(\hat{y}) \left( E[y|\mathcal{F}] - \hat{y} \right).
$$
Hence the first order condition is
$$
-\phi''(\hat{y}) \left(E[y|\mathcal{F}] - \hat{y} \right) = 0,
$$
so the optimal forecast is the conditional mean, $ E[y|\mathcal{F}] $.

Now consider $ E[y|\mathcal{F}] \Rightarrow B(y, \hat{y}) $. It's a simple task of reverse-engineering. We need the f.o.c. to be of the form
$$
const \times \left(E[y|\mathcal{F}] - \hat{y} \right) = 0,
$$
so that the optimal forecast is the conditional mean, $ E[y|\mathcal{F}] $. Inspection reveals that $ B(y, \hat{y}) $ (and only $ B(y, \hat{y}) $) does the trick.

One might still want more intuition for the optimality of the conditional mean under Bregman loss, despite its asymmetry. The answer, I conjecture, is that the Bregman family is not asymmetric! At least not for an appropriate definition of asymmetry in the general $L(y, \hat{y})$ case, which is more complicated and subtle than the $L(e)$ case. Asymmetric loss plots like those in Patton (2014), on which I reported last week, are for fixed $y$ (in Patton's case, $y=2$ ), whereas for a complete treatment we need to look across all $y$. More on that soon.

[I would like to thank -- without implicating -- Minchul Shin for helpful discussions.]

Monday, October 6, 2014

Intuition for Prediction Under Bregman Loss

No comments:

Post a Comment