Sunday, October 16, 2016

Machine Learning vs. Econometrics, III

I emphasized here that both machine learning (ML) and econometrics (E) prominently feature prediction, one distinction being that ML tends to focus on non-causal prediction, whereas a significant part of E focuses on causal prediction. So they're both focused on prediction, but there's a non-causal vs. causal distinction.  [Alternatively, as Dean Foster notes, you can think of both ML and E as focused on estimation, but with different estimands.  ML tends to focus on estimating conditional expectations, whereas the causal part of E focuses on estimating partial derivatives.]

In any event, there's another key distinction between much of ML and Econmetrics/Statistics (E/S):   E/S tends to be more concerned with probabilistic assessment of uncertainty.  Whereas ML is often satisfied with point forecasts, E/S often wants interval, and ultimately density, forecasts.

There are at least two classes of reasons for the difference.

First, E/S recognizes that uncertainty is often of intrinsic economic interest.  Think market risk, credit risk, counter-party risk, systemic risk, inflation risk, business cycle risk, etc.

Second, E/S is evidently uncomfortable with ML's implicit certainty-equivalence approach of simply plugging point forecasts into decision rules obtained under perfect foresight.  Evidently the linear-quadratic-Gaussian world in which certainty equivalence holds resonates less than completely with E/S types.  That sounds right to me.  [By the way, see my earlier piece on optimal prediction under asymmetric loss.]

Monday, October 10, 2016

Machine Learning vs. Econometrics, II

My last post focused on one key distinction between machine learning (ML) and econometrics (E):   non-causal ML prediction vs. causal E prediction.  I promised later to highlight another, even more important, distinction.  I'll get there in the next post.

But first let me note a key similarity.  ML vs. E in terms of non-causal vs. causal prediction is really only comparing ML to "half" of E (the causal part).  The other part of E (and of course statistics, so let's call it E/S), going back a century or so, focuses on non-causal prediction, just like ML.  The leading example is time-series E/S.  Just take a look at an E/S text like Elliott and Timmermann (contents and first chapter here; index here).  A lot of it looks like parts of ML.  But it's not "E/S people chasing ML ideas"; rather, E/S has been in the game for decades, often well ahead of ML.

For this reason much of E/S crowd views "ML" and "data science" as the same old wine in a new bottle.  (The joke goes, Q: What is a data scientist?  A: A statistician who lives in San Francisco.)  ML/DataScience is not the same old wine, but it's a blend, and a significant part of the blend is indeed E/S.

To be continued...

Sunday, October 2, 2016

Machine Learning vs. Econometrics, I

[If you're reading this in email, remember to click through on the title to get the math to render.]

Machine learning (ML) is almost always centered on prediction; think "$$\hat{y}$$".   Econometrics (E) is often, but not always, centered on prediction.  Instead it's also often interested on estimation and associated inference; think "$$\hat{\beta}$$".

Or so the story usually goes. But that misses the real distinction. Both ML and E as described above are centered on prediction.  The key difference is that ML focuses on non-causal prediction (if a new person $$i$$ arrives with covariates $$X_i$$, what is my minimium-MSE guess of her $$y_i$$?), whereas the part of econometrics highlighted above focuses on causal prediction (if I intervene and give person $$i$$ a certain treatment, what is my minimum-MSE guess of $$\Delta y_i$$?).
It just happens that, assuming linearity, a "minimum-MSE guess of $$\Delta y_i$$" is the same as a "minimum-MSE estimate of $$\beta_i$$".

So there is a ML vs. E distinction here, but it's not "prediction vs. estimation" -- it's all prediction.  Instead, the issue is non-causal prediction vs. causal prediction.

But there's another ML vs. E difference that's even more fundamental.  TO BE CONTINUED...

Monday, September 26, 2016

Fascinating Conference at Chicago

I just returned from the University of Chicago conference, "Machine Learning: What's in it for Economics?"  Lots of cool things percolating.  I'm teaching a Penn Ph.D. course later this fall on aspects of the ML/econometrics interface.  Feeling really charged.

By the way, hadn't yet been to the new Chicago economics "cathedral" (Saieh Hall for Economics) and Becker-Friedman Institute.  Wow.  What an institution, both intellectually and physically.

Tuesday, September 20, 2016

On "Shorter Papers"

Journals should not corral shorter papers into sections like "Shorter Papers".  Doing so sends a subtle (actually unsubtle) message that shorter papers are basically second-class citizens, somehow less good, or less important, or less something -- not just less long -- than longer papers.  If a paper is above the bar, then it's above the bar, and regardless of its length it should then be published simply as a paper, not a "shorter paper", or a "note", or anything else.  Many shorter papers are much more important than the vast majority of longer papers.

Monday, September 12, 2016

Time-Series Econometrics and Climate Change

It's exciting to see time series econometrics contributing to the climate change discussion.

Check out the upcoming CREATES conference, "Econometric Models of Climate Change", here.

Here are a few good examples of recent time-series climate research, in chronological order.  (There are many more.  Look through the reference lists, for example, in the 2016 and 2017 papers below.)

Jim Stock et al. (2009) in Climatic Change.

Pierre Perron et al. (2013) in Nature.

Peter Phillips et al. (2016) in Nature.

Proietti and Hillebrand (2017), forthcoming in Journal of the Royal Statistical Society.

Tuesday, September 6, 2016

Inane Journal "Impact Factors"

Why are journals so obsessed with "impact factors"? (The five-year impact factor is average citations/article in a five-year window.)  They're often calculated to three decimal places, and publishers trumpet victory when they go from (say) 1.225 to 1.311!  It's hard to think of a dumber statistic, or dumber over-interpretation.  Are the numbers after the decimal point anything more than noise, and for that matter, are the numbers before the decimal much more than noise?

Why don't journals instead use the same citation indexes used for individuals? The leading index seems to be the h-index, which is the largest integer h such that an individual has h papers, each cited at least h times. I don't know who cooked up the h-index, and
surely it has issues too, but the gurus love it, and in my experience it tells the truth.

Even better, why not stop obsessing over clearly-insufficient statistics of any kind? I propose instead looking at what I'll call a "citation signature plot" (CSP), simply plotting the number of cites for the most-cited paper, the number of cites for the second-most-cited paper, and so on. (Use whatever window(s) you want.) The CSP reveals everything, instantly and visually. How high is the CSP for the top papers? How quickly, and with what pattern, does it approach zero? etc., etc. It's all there.

Google-Scholar CSP's are easy to make for individuals, and they're tremendously informative. They'd be only slightly harder to make for journals. I'd love to see some.

Monday, August 29, 2016

On Credible Cointegration Analyses

I may not know whether some $$I(1)$$ variables are cointegrated, but if they are, I often have a very strong view about the likely number and nature of cointegrating combinations. Single-factor structure is common in many areas of economics and finance, so if cointegration is present in an $$N$$-variable system, for example, a natural benchmark is 1 common trend ($$N-1$$ cointegrating combinations).  And moreover, the natural cointegrating combinations are almost always spreads or ratios (which of course are spreads in logs). For example, log consumption and log income may or may not be cointegrated, but if they are, then the obvious benchmark cointegrating combination is $$(ln C - ln Y)$$. Similarly, the obvious benchmark for $$N$$ government bond yields $$y$$ is $$N-1$$ cointegrating combinations, given by term spreads relative to some reference yield; e.g., $$y_2 - y_1$$, $$y_3 - y_1$$, ..., $$y_N - y_1$$.

There's not much literature exploring this perspective. (One notable exception is Horvath and Watson, "Testing for Cointegration When Some of the Cointegrating Vectors are Prespecified", Econometric Theory, 11, 952-984.) We need more.

Sunday, August 21, 2016

More on Big Data and Mixed Frequencies

I recently blogged on Big Data and mixed-frequency data, arguing that Big Data (wide data, in particular) leads naturally to mixed-frequency data.  (See here for the tall data / wide data / dense data taxonomy.)  The obvious just occurred to me, namely that it's also true in the other direction. That is, mixed-frequency situations also lead naturally to Big Data, and with a subtle twist: the nature of the Big Data may be dense rather than wide. The theoretically-pure way to set things up is as a state-space system laid out at the highest observed frequency, appropriately treating most of the lower-frequency data as missing, as in ADS.  By construction, the system is dense if any of the series are dense, as the system is laid out at the highest frequency.

Wednesday, August 17, 2016

On the Evils of Hodrick-Prescott Detrending

[If you're reading this in email, remember to click through on the title to get the math to render.]

Jim Hamilton has a very cool new paper, "Why You Should Never Use the Hodrick-Prescott (HP) Filter".

Of course we've known of the pitfalls of HP ever since Cogley and Nason (1995) brought them into razor-sharp focus decades ago.  The title of the even-earlier Nelson and Kang (1981) classic, "Spurious Periodicity in Inappropriately Detrended Time Series", says it all.  Nelson-Kang made the spurious-periodicity case against polynomial detrending of I(1) series.  Hamilton makes the spurious-periodicity case against HP detrending of many types of series, including I(1).  (Or, more precisely, Hamilton adds even more weight to the Cogley-Nason spurious-periodicity case against HP.)

But the main contribution of Hamilton's paper is constructive, not destructive.  It provides a superior detrending method, based only on a simple linear projection.

Here's a way to understand what "Hamilton detrending" does and why it works, based on a nice connection to Beveridge-Nelson (1981) detrending not noticed in Hamilton's paper.

First consider Beveridge-Nelson (BN) trend for I(1) series.  BN trend is just a very long-run forecast based on an infinite past.  [You want a very long-run forecast in the BN environment because the stationary cycle washes out from a very long-run forecast, leaving just the forecast of the underlying random-walk stochastic trend, which is also the current value of the trend since it's a random walk.  So the BN trend at any time is just a very long-run forecast made at that time.]  Hence BN trend is implicitly based on the projection: $$y_t ~ \rightarrow ~ c, ~ y_{t-h}, ~...,~ y_{t-h-p}$$, for $$h \rightarrow \infty$$ and $$p \rightarrow \infty$$.

Now consider Hamilton trend.  It is explicitly based on the projection: $$y_t ~ \rightarrow ~ c, ~ y_{t-h}, ~...,~ y_{t-h-p}$$, for $$p = 3$$.  (Hamilton also uses a benchmark of  $$h = 8$$.)

So BN and Hamilton are both "linear projection trends", differing only in choice of $$h$$ and $$p$$!  BN takes an infinite forecast horizon and projects on an infinite past.  Hamilton takes a medium forecast horizon and projects on just the recent past.

Much of Hamilton's paper is devoted to defending the choice of $$p = 3$$, which turns out to perform well for a wide range of data-generating processes (not just I(1)).  The BN choice of $$h = p = \infty$$, in contrast, although optimal for I(1) series, is less robust to other DGP's.  (And of course estimation of the BN projection as written above is infeasible, which people avoid in practice by assuming low-ordered ARIMA structure.)