ABSTRACT
Very large multidimensional arrays are commonly used in data intensive scientific computations as well ason-line analytical processing applications referred to as MOLAP. The storage organization of such arrays on disks is done by partitioning the large global array into fixed size sub-arrays called chunks or tiles that form the units of data transfer between disk and memory. Typical queries involve the retrieval of sub-arrays in a manner that access all chunks that overlap the query results. An important metric of the storage efficiency is the expected number of chunks retrieved over all such queries. The question that immediately arises is "what shapes of array chunks give theminimum expected number of chunks over a query workload?" The problem of optimal chunking was first introduced by Sarawagi and Stonebraker [11] who gave an approximate solution. In this paper we develop exact mathematical models of the problem and provide exact solutions using steepest descent and geometric programming methods. Experimental results, using synthetic and real life workloads, show that our solutions are consistently within than 2.0% of the true number of chunks retrieved for any number of dimensions. In contrast, the approximate solution of [11] can deviate considerably from the true result with increasing number of dimensions and also may lead suboptimal chunk shapes.
- A. V. Aho and J. D. Ullman. Optimal partial-match retrieval when fields are independently specified. ACM Trans. on Database Syst}, 4(2):168 -- 179, Jun. 1979. Google ScholarDigital Library
- S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, New York, 2004. Google ScholarDigital Library
- S. Goil and A. N. Choudhary. Sparse data storage schemes for multidimensional data for olap and data mining. Technical Report CPDC--TR--9801--005, Center for Parallel and Dist. Comput, Northwestern Univ., Evanston, IL--60208, 1997.Google Scholar
- J. Gray, S. Chaudhuri, A. Bosworth, A. Layman, D. Reichart, M. Venkatrao, F. Pellow, and H. Pirahesh. Data cube: A relational aggregation operator generalizing group-by, cross-tab, and sub-totals. J. Data Mining and Knowledge Discovery, 1(1):29--53, 1997. Google ScholarDigital Library
- Hierachical Data Format (HDF) group. HDF5 User's Guide. National Center for Supercomputing Applications (NCSA), University of Illinois, Urbana-Champaign, Illinois, Urbana-Champaign, Nov. 2004.Google Scholar
- H. V. Jagadish. Linear clustering of objects with multiple attributes. In SIGMOD '90: Proc. Int'l. Conf. on Management of Data, pages 332--342, New York, NY, USA, 1990. ACM Press. Google ScholarDigital Library
- N. Karayannidis and T. Sellis. Sisyphus: The implementation of a chunk-based storage manager for olap data cubes. Data snf Knowl. Eng., 45(2):155--180, 2003. Google ScholarDigital Library
- E. J. Otoo and D. Rotem. Efficient storage allocation of large-scale extendible multi-dimensional scientific datasets. In Proc. 18th Int'l. Conf. Scientific and Statistical Database Management (SSDBM'06), Vienna, Austria, Jul. 3 -- 5 2006. Google ScholarDigital Library
- E. J. Otoo, D. Rotem, and S. Seshadri. Chunking of large multidimensional arrays. Technical Report LBNL--63230, LBNL, University of California, I Cyclotron Road, Berkeley, CA 94720, USA, Jul 2006.Google Scholar
- D. Rotem and J. L. Zhao. Extendible arrays for statistical databases and {OLAP} applications. In 8th Int'l. Conf. on Sc. and Stat. Database Management (SSDBM'96), pages 108--117, Stockholm, Sweden, 1996. Google ScholarDigital Library
- S. Sarawagi and M. Stonebraker. Efficient organization of large multidimenional arrays. In Proc. 10th Int'l. Conf. Data Eng.}, pages 328 -- 336, Feb 1994. Google ScholarDigital Library
- K. E. Seamons and M. Winslett. Physical schemas for large multidimensional arrays in scientific computing applications. In Proc. 7th Int'l. Conf. on Scientific and Statistical Database Management, pages 218--227, Washington, DC, USA, 1994. IEEE Computer Society. Google ScholarDigital Library
- A. Shoshani. OLAP and statistical databases: Similarities and differences. In Proc. ACM--PODS Conf., pages 185--196, 1997. Google ScholarDigital Library
- Y. Zhao, P. M. Deshpande, and J. F. Naughton. An array-based algorithm for simultaneous multidimensional aggregates. In Proc. ACM--SIGMOD Conf., pages 159--170, 1997. Google ScholarDigital Library
Index Terms
- Optimal chunking of large multidimensional arrays for data warehousing
Recommendations
Multilevel chunking of multidimensional arrays
AICCSA '05: Proceedings of the ACS/IEEE 2005 International Conference on Computer Systems and ApplicationsSummary form only given. Multidimensional arrays have become very common in scientific and business applications. Such arrays are usually very large and hence they are stored on secondary or tertiary storage systems. For processing, various parts of a ...
Optimal chunking and partial caching in information-centric networks
Caching is widely used to reduce network traffic and improve user experience. Traditionally caches store complete objects, but video files and the recent emergence of information-centric networking have highlighted a need for understanding how partial ...
Chunked extendible dense arrays for scientific data storage
Several meetings of the Extremely Large Databases Community for large scale scientific applications advocate the use of multidimensional arrays as the appropriate model for representing scientific databases. Scientific databases gradually grow to ...
Comments