skip to main content
10.1145/1317331.1317337acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article

Optimal chunking of large multidimensional arrays for data warehousing

Published:09 November 2007Publication History

ABSTRACT

Very large multidimensional arrays are commonly used in data intensive scientific computations as well ason-line analytical processing applications referred to as MOLAP. The storage organization of such arrays on disks is done by partitioning the large global array into fixed size sub-arrays called chunks or tiles that form the units of data transfer between disk and memory. Typical queries involve the retrieval of sub-arrays in a manner that access all chunks that overlap the query results. An important metric of the storage efficiency is the expected number of chunks retrieved over all such queries. The question that immediately arises is "what shapes of array chunks give theminimum expected number of chunks over a query workload?" The problem of optimal chunking was first introduced by Sarawagi and Stonebraker [11] who gave an approximate solution. In this paper we develop exact mathematical models of the problem and provide exact solutions using steepest descent and geometric programming methods. Experimental results, using synthetic and real life workloads, show that our solutions are consistently within than 2.0% of the true number of chunks retrieved for any number of dimensions. In contrast, the approximate solution of [11] can deviate considerably from the true result with increasing number of dimensions and also may lead suboptimal chunk shapes.

References

  1. A. V. Aho and J. D. Ullman. Optimal partial-match retrieval when fields are independently specified. ACM Trans. on Database Syst}, 4(2):168 -- 179, Jun. 1979. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, New York, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. S. Goil and A. N. Choudhary. Sparse data storage schemes for multidimensional data for olap and data mining. Technical Report CPDC--TR--9801--005, Center for Parallel and Dist. Comput, Northwestern Univ., Evanston, IL--60208, 1997.Google ScholarGoogle Scholar
  4. J. Gray, S. Chaudhuri, A. Bosworth, A. Layman, D. Reichart, M. Venkatrao, F. Pellow, and H. Pirahesh. Data cube: A relational aggregation operator generalizing group-by, cross-tab, and sub-totals. J. Data Mining and Knowledge Discovery, 1(1):29--53, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Hierachical Data Format (HDF) group. HDF5 User's Guide. National Center for Supercomputing Applications (NCSA), University of Illinois, Urbana-Champaign, Illinois, Urbana-Champaign, Nov. 2004.Google ScholarGoogle Scholar
  6. H. V. Jagadish. Linear clustering of objects with multiple attributes. In SIGMOD '90: Proc. Int'l. Conf. on Management of Data, pages 332--342, New York, NY, USA, 1990. ACM Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. N. Karayannidis and T. Sellis. Sisyphus: The implementation of a chunk-based storage manager for olap data cubes. Data snf Knowl. Eng., 45(2):155--180, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. E. J. Otoo and D. Rotem. Efficient storage allocation of large-scale extendible multi-dimensional scientific datasets. In Proc. 18th Int'l. Conf. Scientific and Statistical Database Management (SSDBM'06), Vienna, Austria, Jul. 3 -- 5 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. E. J. Otoo, D. Rotem, and S. Seshadri. Chunking of large multidimensional arrays. Technical Report LBNL--63230, LBNL, University of California, I Cyclotron Road, Berkeley, CA 94720, USA, Jul 2006.Google ScholarGoogle Scholar
  10. D. Rotem and J. L. Zhao. Extendible arrays for statistical databases and {OLAP} applications. In 8th Int'l. Conf. on Sc. and Stat. Database Management (SSDBM'96), pages 108--117, Stockholm, Sweden, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. S. Sarawagi and M. Stonebraker. Efficient organization of large multidimenional arrays. In Proc. 10th Int'l. Conf. Data Eng.}, pages 328 -- 336, Feb 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. K. E. Seamons and M. Winslett. Physical schemas for large multidimensional arrays in scientific computing applications. In Proc. 7th Int'l. Conf. on Scientific and Statistical Database Management, pages 218--227, Washington, DC, USA, 1994. IEEE Computer Society. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. A. Shoshani. OLAP and statistical databases: Similarities and differences. In Proc. ACM--PODS Conf., pages 185--196, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Y. Zhao, P. M. Deshpande, and J. F. Naughton. An array-based algorithm for simultaneous multidimensional aggregates. In Proc. ACM--SIGMOD Conf., pages 159--170, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Optimal chunking of large multidimensional arrays for data warehousing

            Recommendations

            Comments

            Login options

            Check if you have access through your login credentials or your institution to get full access on this article.

            Sign in
            • Published in

              cover image ACM Conferences
              DOLAP '07: Proceedings of the ACM tenth international workshop on Data warehousing and OLAP
              November 2007
              112 pages
              ISBN:9781595938275
              DOI:10.1145/1317331

              Copyright © 2007 ACM

              Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

              Publisher

              Association for Computing Machinery

              New York, NY, United States

              Publication History

              • Published: 9 November 2007

              Permissions

              Request permissions about this article.

              Request Permissions

              Check for updates

              Qualifiers

              • research-article

              Acceptance Rates

              Overall Acceptance Rate29of79submissions,37%

              Upcoming Conference

            PDF Format

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader