Wednesday, April 13, 2016

Big Data: Tall, Wide, and Dense

It strikes me that "tall", "wide", and "dense" might be useful words and conceptualizations of aspects of Big Data relevant in time-series econometrics.

Think of a  regression situation, with a  (T x K) "X matrix" for  T "days" (or whatever) of data for each of K variables.  Now imagine sampling intra-day, m times per day.  Then  X is (mT x K).  Big data correspond to huge-X situations arising because one or more of T, K, and m is huge.  (Of course there will always be subjectivity associated with "how huge is huge".)

T, K, and m are usefully considered separately.

-- As T gets large we have "tall data" (in reference to the tall X matrix, due to the large number of time periods, i.e., the long calendar span of data)

-- As K gets large we have "wide data" (in reference to the wide X matrix due to the large number of regressors)

-- As m gets large we have "dense data" (in reference to the high-frequency intra-day sampling, regardless of whether the data are tall)

A few examples:

--  Consider 2500 days of 1-minute returns for each of 5000 stocks.  The data are tall, wide and dense.

--  Consider 25 days of 1-minute returns for each of 50 stocks.  The data are dense, but neither tall nor wide.

--  Consider 2500 days of daily returns for each of 5000 stocks.   The data are tall and wide, but not dense.