Monday, April 21, 2014

On Kaggle Forecasting Competitions, Part 1: The Hold-Out Sample(s)

Kaggle competitions are potentially pretty cool. Kaggle supplies in-sample data ("training data"), and you build a model and forecast out-of-sample data that they withhold ("test data"). The winner gets a significant prize, often $100,000.00 or more. Kaggle typically runs several such competitions simultaneously.

The Kaggle paradigm is clever because it effectively removes the ability for modelers to peek at the test data, which is a key criticism of model-selection procedures that claim to insure against finite-sample over-fitting by use of split samples. (See my earlier post, Comparing Predictive Accuracy, Twenty Years Later, and the associated paper of the same name.)

Well, sort of. Actually, Kaggle partly reveals part of the test data. In the time before a competition deadline, participants are typically allowed to submit one forecast per day, which Kaggle scores against part of the test data. Then, when the deadline arrives, forecasts are actually scored against the remaining test data. Suppose, for example, that there are 100 observations in total. Kaggle gives you 1, ..., 60 (training) and holds out 61, ..., 100 (test). But each day before the deadline, you can submit a forecast for 61, ..., 75, which they score against the held-out realization of 61,..., 75 and use to update the "leaderboard." Then when the deadline arrives, you submit your forecast for 61, .., 100, but they score it only against the truly held-out realizations 76, ..., 100. So honesty is enforced for 76, ..., 100 (good) , but convoluted games are played with 61, ..., 75 (bad). Is having a leaderboard really that important? Why not cut the games? Simply give people 1, ..., 75 and ask them to forecast 76, ..., 100.

To be continued.