Monday, July 25, 2016

The Action is in Wide and/or Dense Data

I recently blogged on varieties of Big Data: (1) tall, (2) wide, and (3) dense.

Presumably tall data are the least interesting insofar as the only way to get a long calendar span is to sit around and wait, in contrast to wide and dense data, which now appear routinely.

But it occurs to me that tall data are also the least interesting for another reason:  wide data make tall data impossible from a certain perspective. In particular, non-parametric estimation in high dimensions (that is, with wide data) is always subject to the fundamental and inescapable "curse of dimensionality":  the rate at which estimation error vanishes gets hopelessly slow, very quickly, as dimension grows.  [Wonks will recall that the Stone-optimal rate in \(d\) dimensions is \( \sqrt{T^{1- \frac{d}{d+4}}}\).]

The upshot:  As our datasets get wider, they also implicitly get less tall. That's all the more reason to downplay tall data.  The action is in wide and dense data (whether separately or jointly).