Think of a regression situation, with a (T x K) "X matrix" for T "days" (or whatever) of data for each of K variables. Now imagine sampling intra-day, m times per day. Then X is (mT x K). Big data correspond to huge-X situations arising because one or more of T, K, and m is huge. (Of course there will always be subjectivity associated with "how huge is huge".)
T, K, and m are usefully considered separately.
-- As T gets large we have "tall data" (in reference to the tall X matrix, due to the large number of time periods, i.e., the long calendar span of data)
-- As K gets large we have "wide data" (in reference to the wide X matrix due to the large number of regressors)
A few examples:
-- Consider 2500 days of 1-minute returns for each of 5000 stocks. The data are tall, wide and dense.
-- Consider 25 days of 1-minute returns for each of 50 stocks. The data are dense, but neither tall nor wide.
-- Consider 2500 days of daily returns for each of 5000 stocks. The data are tall and wide, but not dense.
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.