skip to main content
10.1145/1317331.1317337acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article

Optimal chunking of large multidimensional arrays for data warehousing

Published: 09 November 2007 Publication History

Abstract

Very large multidimensional arrays are commonly used in data intensive scientific computations as well ason-line analytical processing applications referred to as MOLAP. The storage organization of such arrays on disks is done by partitioning the large global array into fixed size sub-arrays called chunks or tiles that form the units of data transfer between disk and memory. Typical queries involve the retrieval of sub-arrays in a manner that access all chunks that overlap the query results. An important metric of the storage efficiency is the expected number of chunks retrieved over all such queries. The question that immediately arises is "what shapes of array chunks give theminimum expected number of chunks over a query workload?" The problem of optimal chunking was first introduced by Sarawagi and Stonebraker [11] who gave an approximate solution. In this paper we develop exact mathematical models of the problem and provide exact solutions using steepest descent and geometric programming methods. Experimental results, using synthetic and real life workloads, show that our solutions are consistently within than 2.0% of the true number of chunks retrieved for any number of dimensions. In contrast, the approximate solution of [11] can deviate considerably from the true result with increasing number of dimensions and also may lead suboptimal chunk shapes.

References

[1]
A. V. Aho and J. D. Ullman. Optimal partial-match retrieval when fields are independently specified. ACM Trans. on Database Syst}, 4(2):168 -- 179, Jun. 1979.
[2]
S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, New York, 2004.
[3]
S. Goil and A. N. Choudhary. Sparse data storage schemes for multidimensional data for olap and data mining. Technical Report CPDC--TR--9801--005, Center for Parallel and Dist. Comput, Northwestern Univ., Evanston, IL--60208, 1997.
[4]
J. Gray, S. Chaudhuri, A. Bosworth, A. Layman, D. Reichart, M. Venkatrao, F. Pellow, and H. Pirahesh. Data cube: A relational aggregation operator generalizing group-by, cross-tab, and sub-totals. J. Data Mining and Knowledge Discovery, 1(1):29--53, 1997.
[5]
Hierachical Data Format (HDF) group. HDF5 User's Guide. National Center for Supercomputing Applications (NCSA), University of Illinois, Urbana-Champaign, Illinois, Urbana-Champaign, Nov. 2004.
[6]
H. V. Jagadish. Linear clustering of objects with multiple attributes. In SIGMOD '90: Proc. Int'l. Conf. on Management of Data, pages 332--342, New York, NY, USA, 1990. ACM Press.
[7]
N. Karayannidis and T. Sellis. Sisyphus: The implementation of a chunk-based storage manager for olap data cubes. Data snf Knowl. Eng., 45(2):155--180, 2003.
[8]
E. J. Otoo and D. Rotem. Efficient storage allocation of large-scale extendible multi-dimensional scientific datasets. In Proc. 18th Int'l. Conf. Scientific and Statistical Database Management (SSDBM'06), Vienna, Austria, Jul. 3 -- 5 2006.
[9]
E. J. Otoo, D. Rotem, and S. Seshadri. Chunking of large multidimensional arrays. Technical Report LBNL--63230, LBNL, University of California, I Cyclotron Road, Berkeley, CA 94720, USA, Jul 2006.
[10]
D. Rotem and J. L. Zhao. Extendible arrays for statistical databases and {OLAP} applications. In 8th Int'l. Conf. on Sc. and Stat. Database Management (SSDBM'96), pages 108--117, Stockholm, Sweden, 1996.
[11]
S. Sarawagi and M. Stonebraker. Efficient organization of large multidimenional arrays. In Proc. 10th Int'l. Conf. Data Eng.}, pages 328 -- 336, Feb 1994.
[12]
K. E. Seamons and M. Winslett. Physical schemas for large multidimensional arrays in scientific computing applications. In Proc. 7th Int'l. Conf. on Scientific and Statistical Database Management, pages 218--227, Washington, DC, USA, 1994. IEEE Computer Society.
[13]
A. Shoshani. OLAP and statistical databases: Similarities and differences. In Proc. ACM--PODS Conf., pages 185--196, 1997.
[14]
Y. Zhao, P. M. Deshpande, and J. F. Naughton. An array-based algorithm for simultaneous multidimensional aggregates. In Proc. ACM--SIGMOD Conf., pages 159--170, 1997.

Cited By

View all
  • (2022)Replicated layout for in-memory database systemsProceedings of the VLDB Endowment10.14778/3503585.350360615:4(984-997)Online publication date: 14-Apr-2022
  • (2022)Efficient Partitioning Method for Optimizing the Compression on Array DataJournal of Computer Science and Technology10.1007/s11390-022-2371-737:5(1049-1067)Online publication date: 30-Sep-2022
  • (2022)Chunk-oriented dimension ordering for efficient range query processing on sparse multidimensional dataWorld Wide Web10.1007/s11280-022-01098-z26:4(1395-1433)Online publication date: 9-Sep-2022
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
DOLAP '07: Proceedings of the ACM tenth international workshop on Data warehousing and OLAP
November 2007
112 pages
ISBN:9781595938275
DOI:10.1145/1317331
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 09 November 2007

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. chunking
  2. data warehousing
  3. multi-dimensional arrays

Qualifiers

  • Research-article

Conference

CIKM07

Acceptance Rates

Overall Acceptance Rate 29 of 79 submissions, 37%

Upcoming Conference

CIKM '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)8
  • Downloads (Last 6 weeks)1
Reflects downloads up to 07 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2022)Replicated layout for in-memory database systemsProceedings of the VLDB Endowment10.14778/3503585.350360615:4(984-997)Online publication date: 14-Apr-2022
  • (2022)Efficient Partitioning Method for Optimizing the Compression on Array DataJournal of Computer Science and Technology10.1007/s11390-022-2371-737:5(1049-1067)Online publication date: 30-Sep-2022
  • (2022)Chunk-oriented dimension ordering for efficient range query processing on sparse multidimensional dataWorld Wide Web10.1007/s11280-022-01098-z26:4(1395-1433)Online publication date: 9-Sep-2022
  • (2021)Decentralized Storage for Scientific Data2021 IEEE International Conference on Big Data (Big Data)10.1109/BigData52589.2021.9671480(3760-3769)Online publication date: 15-Dec-2021
  • (2021)A scalable array storage for efficient maintenance of future dataThe Journal of Supercomputing10.1007/s11227-020-03554-xOnline publication date: 2-Jan-2021
  • (2017)QETL: An approach to on-demand ETL from non-owned data sourcesData & Knowledge Engineering10.1016/j.datak.2017.09.002112(17-37)Online publication date: Nov-2017
  • (2017)Efficient representation of higher-dimensional arrays by dimension transformationsThe Journal of Supercomputing10.1007/s11227-016-1954-x73:6(2801-2822)Online publication date: 1-Jun-2017
  • (2016)History-Pattern Encoding for Large-Scale Dynamic Multidimensional Datasets and Its EvaluationsIEICE Transactions on Information and Systems10.1587/transinf.2015DAP0025E99.D:4(989-999)Online publication date: 2016
  • (2016)A scalable storage system for structured data based on higher order index arrayProceedings of the 3rd IEEE/ACM International Conference on Big Data Computing, Applications and Technologies10.1145/3006299.3006333(247-252)Online publication date: 6-Dec-2016
  • (2016)Accelerating range queries for large-scale unstructured meshes2016 IEEE International Conference on Big Data (Big Data)10.1109/BigData.2016.7840641(502-511)Online publication date: Dec-2016
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media