Thursday, October 29, 2015

Viewing Emailed Posts that Contain Math

A reminder for those of you who subscribe to the email feed:

MathJax doesn't display in email, so when you look at the emailed post, the math will just be in LaTeX. Simply click/tap on the post's title in your email (e.g., in the latest case, “The HAC Emperor has no Clothes”). It's hyperlinked to the actual blog site, which should display fine on all devices. 

Wednesday, October 28, 2015

The HAC Emperor has no Clothes

Well, at least in time-series settings. (I'll save cross sections for a later post.)

Consider a time-series regression with possibly heteroskedastic and/or autocorrelated disturbances, 

\( y_t = x_t' \beta + \varepsilon_t  \). 
A popular approach is to punt on the potentially non-iid disturbance, instead simply running OLS with kernel-based heteroskedasticity and autocorrelation consistent (HAC) standard errors.

Punting via kernel-HAC estimation is a bad idea in time series, for several reasons:

(1) [Kernel-HAC is not likely to produce good \(\beta\) estimates.] It stays with OLS and hence gives up on efficient estimation of \(\hat{\beta}\). In huge samples the efficiency loss from using OLS rather than GLS/ML is likely negligible, but time-series samples are often smallish. For example, samples like 1960Q1-2014Q4 are typical in macroeconomics -- just a couple hundred observations of highly-serially-correlated data.

(2) [Kernel-HAC is not likely to produce good \(\beta\) inference.] Its standard errors are not tailored to a specific parametric approximation to \(\varepsilon\) dynamics. Proponents will quickly counter that that's a benefit, not a cost, and in some settings the proponents may be correct. But not in time series settings. In time series, \(\varepsilon\) dynamics are almost always accurately and parsimoniously approximated parametrically (ARMA for conditional mean dynamics in \(\varepsilon\), and GARCH for conditional variance dynamics in \(\varepsilon\)). Hence kernel-HAC standard errors may be unnecessarily unreliable in small samples, even if they're accurate asymptotically. And again, time-series sample sizes are often smallish. 

(3) [Most crucially, kernel-HAC fails to capture invaluable predictive information.] Time series econometrics is intimately concerned with prediction, and explicit parametric modeling of dynamic heteroskedasticity and autocorrelation in \(\varepsilon\) can be used for improved prediction of \(y\). Autocorrelation can be exploited for improved point prediction, and dynamic conditional heteroskedasticity can be exploited for improved interval and density prediction. Punt on them and you're potentially leaving a huge amount of money on the table.

The clearly preferable approach is traditional parametric disturbance heteroskedasticty / autocorrelation modeling, with GLS/ML estimation. Simply allow for ARMA(p,q)-GARCH(P,Q) disturbances (say), with p,q, P and Q selected by AIC (say). (In many applications something like AR(3)-GARCH(1,1) or 
ARMA(1,1)-GARCH(1,1) would be more than adequate.) Note that the traditional approach is actually fully non-parametric when appropriately viewed as a sieve, and moreover it features automatic bandwidth selection.

Kernel-HAC people call the traditional strategy "pre-whitening," to be done prior to kernel-HAC estimation. But the real point is that it's all -- or at least mostly all -- in the pre-whitening.

In closing, I might add that the view expressed here is strongly supported by top-flight research. On my point (2) and my general recommendation, for example, see the insightful work of den Haan and Levin (2000). It fell on curiously deaf ears and remains unpublished many years later. (It's on Wouter den Haan's web site in a section called "Sleeping and Hard to Get"!) In the interim much of the world jumped on the kernel-HAC bandwagon. It's time to jump off.

Sunday, October 25, 2015

Predictive Accuracy Rankings by MSE vs. MAE

We've all ranked forecast accuracy by mean squared error (MSE) and mean absolute error (MAE), the two great workhorses of relative accuracy comparison. MSE-rankings and MAE-rankings often agree, but they certainly don't have to -- they're simply different loss functions -- which is why we typically calculate and examine both.

Here's a trivially simple question: Under what conditions will MSE-rankings and MAE-rankings agree? It turns out that the answer it is not at all trivial -- indeed it's unknown. Things get very difficult, very quickly. 

With \(N(\mu, \sigma^2)\) forecast errors we have that

\( E(|e|) = \sigma \sqrt{2/\pi} \exp\left( -\frac{\mu^{2}}{2 \sigma^{2}}\right) + \mu \left[1-2 \Phi\left(-\frac{\mu}{\sigma} \right) \right], \)
where \(\Phi(\cdot)\) is the standard normal cdf. This relates MAE to the two components of MSE, bias (\(\mu\)) and variance (\(\sigma^2\)), but the relationship is complex. In the unbiased Gaussian case (\(\mu=0\) ), the result collapses to \(MAE \propto \sigma \), so that MSE-rankings and MAE-rankings must agree. But the unbiased Gaussian case is very, very special, and little else is known.

Some energetic grad student should crack this nut, giving necessary and sufficient conditions for identical MSE and MAE rankings in general environments. Two leads: See section 5 of Diebold-Shin (2014), who give a numerical characterization in the biased Gaussian case, and section 2 of Ardakani et al. (2015), who make analytical progress using the SED representation of expected loss.

Friday, October 23, 2015

Victor Zarnowitz and the Jewish Diaspora

Let me be clear: I'm a German/Irish Philadelphia Catholic (with a bit more mixed in...), a typical present-day product of nineteenth-century U.S. immigration. So what do I really know about the Jewish experience, and why am I fit to pontificate (so to speak)? Of course I'm not, but that never stopped me before.

Credible estimates suggest that between 1939 and the end of WWII, ten million Jews left Europe. One of them was Victor Zarnowitz. He and I and other close colleagues with similar interests, not least Glenn Rudebusch, saw a lot of each other in the eighties and nineties and zeros, learned in spades from each other, and immensely enjoyed the ride.

But that's just the professional side. For Victor's full story, see his Fleeing the Nazis, Surviving the Gulag, and Arriving in the Free World: My Life and Times. I cried when reading it several years ago. All I'll say is that you should get it and read it. His courage and strength are jaw-dropping and intensely inspirational. And he's just one among millions.

[The impetus for this post came from the outpouring of emails for another recent post that mentioned Victor. Thanks everyone for your memories. Sorry that I had to disable blog comments. Maybe someday I'll bring them back.]

Saturday, October 17, 2015

Athey and Imbens on Machine Learning and Econometrics

Check out Susan Athey and Guido Imbens' NBER Summer Institute 2015 "Lectures on Machine Learning". (Be sure to scroll down, as there are four separate videos.) I missed the lectures this summer, and I just remembered that they're on video. Great stuff, reflecting parts of an emerging blend of machine learning (ML), time-series econometrics (TSE) and cross-section econometrics (CSE).

The characteristics of ML are basically (1) emphasis on overall modeling, for prediction (as opposed, for example, to emphasis on inference), (2) moreover, emphasis on non-causal modeling and prediction, (3) emphasis on computationally-intensive methods and algorithmic development, and (4) emphasis on large and often high-dimensional datasets.

Readers of this blog will recognize the ML characteristics as closely matching those of TSE! Rob Engle's V-Lab at NYU Stern's Volatility Institute, for example, embeds all of (1)-(4).
 So TSE and ML have a lot to learn from each other, but the required bridge is arguably quite short.

Interestingly, Athey and Imbens come not from the TSE tradition, but rather from the CSE tradition, which typically emphasizes causal estimation and inference. That makes for a longer required CSE-ML bridge, but it may also make for a larger payoff from building and crossing it (in both directions).

In any event I share Athey and Imbens' excitement, and I welcome any and all cross-fertilization of ML, TSE and CSE.

Sunday, October 11, 2015

On Forecast Intervals "too Wide to be Useful"

I keep hearing people say things like this or that forecast interval is "too wide to be useful." 

In general, equating "wide" intervals with "useless" intervals is nonsense. A good (useful) forecast interval is one that's correctly conditionally calibrated; see Christoffersen (International Economic Review, 1998). If a correctly-conditionally-calibrated interval is wide, then so be it. If conditional risk is truly high, then a wide interval is appropriate and desirable.

[Note well:  The relevant calibration concept is conditional. It's not enough for a forecast interval to be merely correctly unconditionally calibrated, which means that an allegedly x percent interval actually winds up containing the realization x percent of the time. That's necessary, but not sufficient, for correct conditional calibration. Again, see Christoffersen.]

Of course all this holds as well for density forecasts.  
Whether a density forecast is "good" has nothing to do with its dispersion. Rather, in precise parallel to interval forecasts, a good density forecast is one that's correctly conditionally calibrated; see Diebold, Gunther and Tay (International Economic Review, 1998). 

Sunday, October 4, 2015

Whither Econometric Principal-Components Regressions?

Principal-components regression (PCR) is routine in applied time-series econometrics.

Why so much PCR, and so little ridge regression? Ridge and PCR are both shrinkage procedures involving PC's. The difference is that ridge effectively includes all PC's and shrinks according to sizes of associated eigenvalues, whereas PCR effectively shrinks some PCs completely to zero (those not included) and doesn't shrink others at all (those included). 

Does not ridge resonate as more natural and appropriate? 

This recognition is hardly new or secret. It's in standard texts, like the beautiful Hastie et al. Elements of Statistical Learning.  

Econometricians should pay more attention to ridge.  

Thursday, October 1, 2015

Balke et al. on Real-Time Nowcasting

Check out the new paper, "Incorporating the Beige Book in a Quantitative Index of Economic Activity," by Nathan Balke, Michael Fulmer and Ren Zhang (BFZ).

[The Beige Book (BB) is a written description of U.S. economic conditions, produced by the Federal Reserve system. It is released eight times a year, roughly two weeks before the FOMC meeting.]

Basically BFZ include BB in an otherwise-standard FRB Philadelphia ADS Index.  Here's the abstract:  
We apply customized text analytics to the written description contained in the BB to obtain a quantitative measure of current economic conditions. This quantitative BB measure is then included into a dynamic factor index model that also contains other commonly used quantitative economic data. We find that at the time the BB is released, the BB has information about current economic activity not contained in other quantitative data. This is particularly the case during recessionary periods. However, by three weeks after its release date,"old" BB contain little additional information about economic activity not already contained in other quantitative data.  

The paper is interesting for several reasons.

First, from a technical viewpoint, BFZ take mixed-frequency data to the max, because Beige Book releases are unequally spaced. Their modified ADS has quarterly, monthly, weekly, and now unequally-spaced, variables.  But the Kalman filter handles it all, seamlessly.   

Second, including Beige Book -- basically "the view of the Federal Reserve System" -- is a novel and potentially large expansion of the nowcast information set.

Third, BFZ approach the evaluation problem in a very clever way, not revealed in the abstract. They view the initial ADS releases (with vs. without BB included) as forecasts of final-revised ADS (without BB included). They find large gains from including BB in estimating time t activity using time t vintage data, but little gain from including BB in estimating time t-30 (days) activity using time t vintage data. That is, including BB in ADS improves real-time nowcasting, even if it evidently adds little to retrospective historical assessment.