research-article

Optimal chunking of large multidimensional arrays for data warehousing

Authors:

E. J. Otoo,

Doron Rotem,

Sridhar SeshadriAuthors Info & Claims

DOLAP '07: Proceedings of the ACM tenth international workshop on Data warehousing and OLAP

Pages 25 - 32

https://doi.org/10.1145/1317331.1317337

Published: 09 November 2007 Publication History

Get Access

Abstract

Very large multidimensional arrays are commonly used in data intensive scientific computations as well ason-line analytical processing applications referred to as MOLAP. The storage organization of such arrays on disks is done by partitioning the large global array into fixed size sub-arrays called chunks or tiles that form the units of data transfer between disk and memory. Typical queries involve the retrieval of sub-arrays in a manner that access all chunks that overlap the query results. An important metric of the storage efficiency is the expected number of chunks retrieved over all such queries. The question that immediately arises is "what shapes of array chunks give theminimum expected number of chunks over a query workload?" The problem of optimal chunking was first introduced by Sarawagi and Stonebraker [11] who gave an approximate solution. In this paper we develop exact mathematical models of the problem and provide exact solutions using steepest descent and geometric programming methods. Experimental results, using synthetic and real life workloads, show that our solutions are consistently within than 2.0% of the true number of chunks retrieved for any number of dimensions. In contrast, the approximate solution of [11] can deviate considerably from the true result with increasing number of dimensions and also may lead suboptimal chunk shapes.

References

[1]

A. V. Aho and J. D. Ullman. Optimal partial-match retrieval when fields are independently specified. ACM Trans. on Database Syst}, 4(2):168 -- 179, Jun. 1979.

Digital Library

Google Scholar

[2]

S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, New York, 2004.

Digital Library

Google Scholar

[3]

S. Goil and A. N. Choudhary. Sparse data storage schemes for multidimensional data for olap and data mining. Technical Report CPDC--TR--9801--005, Center for Parallel and Dist. Comput, Northwestern Univ., Evanston, IL--60208, 1997.

Google Scholar

[4]

J. Gray, S. Chaudhuri, A. Bosworth, A. Layman, D. Reichart, M. Venkatrao, F. Pellow, and H. Pirahesh. Data cube: A relational aggregation operator generalizing group-by, cross-tab, and sub-totals. J. Data Mining and Knowledge Discovery, 1(1):29--53, 1997.

Digital Library

Google Scholar

[5]

Hierachical Data Format (HDF) group. HDF5 User's Guide. National Center for Supercomputing Applications (NCSA), University of Illinois, Urbana-Champaign, Illinois, Urbana-Champaign, Nov. 2004.

Google Scholar

[6]

H. V. Jagadish. Linear clustering of objects with multiple attributes. In SIGMOD '90: Proc. Int'l. Conf. on Management of Data, pages 332--342, New York, NY, USA, 1990. ACM Press.

Digital Library

Google Scholar

[7]

N. Karayannidis and T. Sellis. Sisyphus: The implementation of a chunk-based storage manager for olap data cubes. Data snf Knowl. Eng., 45(2):155--180, 2003.

Digital Library

Google Scholar

[8]

E. J. Otoo and D. Rotem. Efficient storage allocation of large-scale extendible multi-dimensional scientific datasets. In Proc. 18th Int'l. Conf. Scientific and Statistical Database Management (SSDBM'06), Vienna, Austria, Jul. 3 -- 5 2006.

Digital Library

Google Scholar

[9]

E. J. Otoo, D. Rotem, and S. Seshadri. Chunking of large multidimensional arrays. Technical Report LBNL--63230, LBNL, University of California, I Cyclotron Road, Berkeley, CA 94720, USA, Jul 2006.

Google Scholar

[10]

D. Rotem and J. L. Zhao. Extendible arrays for statistical databases and {OLAP} applications. In 8th Int'l. Conf. on Sc. and Stat. Database Management (SSDBM'96), pages 108--117, Stockholm, Sweden, 1996.

Digital Library

Google Scholar

[11]

S. Sarawagi and M. Stonebraker. Efficient organization of large multidimenional arrays. In Proc. 10th Int'l. Conf. Data Eng.}, pages 328 -- 336, Feb 1994.

Digital Library

Google Scholar

[12]

K. E. Seamons and M. Winslett. Physical schemas for large multidimensional arrays in scientific computing applications. In Proc. 7th Int'l. Conf. on Scientific and Statistical Database Management, pages 218--227, Washington, DC, USA, 1994. IEEE Computer Society.

Digital Library

Google Scholar

[13]

A. Shoshani. OLAP and statistical databases: Similarities and differences. In Proc. ACM--PODS Conf., pages 185--196, 1997.

Digital Library

Google Scholar

[14]

Y. Zhao, P. M. Deshpande, and J. F. Naughton. An array-based algorithm for simultaneous multidimensional aggregates. In Proc. ACM--SIGMOD Conf., pages 159--170, 1997.

Digital Library

Google Scholar

Cited By

View all

Sudhir SCafarella MMadden S(2022)Replicated layout for in-memory database systemsProceedings of the VLDB Endowment10.14778/3503585.350360615:4(984-997)Online publication date: 14-Apr-2022
https://dl.acm.org/doi/10.14778/3503585.3503606
Han SLiu XLi J(2022)Efficient Partitioning Method for Optimizing the Compression on Array DataJournal of Computer Science and Technology10.1007/s11390-022-2371-737:5(1049-1067)Online publication date: 30-Sep-2022
https://doi.org/10.1007/s11390-022-2371-7
Han SLiu XLi J(2022)Chunk-oriented dimension ordering for efficient range query processing on sparse multidimensional dataWorld Wide Web10.1007/s11280-022-01098-z26:4(1395-1433)Online publication date: 9-Sep-2022
https://doi.org/10.1007/s11280-022-01098-z
Show More Cited By

Index Terms

Optimal chunking of large multidimensional arrays for data warehousing

Recommendations

Multilevel chunking of multidimensional arrays
AICCSA '05: Proceedings of the ACS/IEEE 2005 International Conference on Computer Systems and Applications

Summary form only given. Multidimensional arrays have become very common in scientific and business applications. Such arrays are usually very large and hence they are stored on secondary or tertiary storage systems. For processing, various parts of a ...
Optimal chunking and partial caching in information-centric networks

Caching is widely used to reduce network traffic and improve user experience. Traditionally caches store complete objects, but video files and the recent emergence of information-centric networking have highlighted a need for understanding how partial ...
Chunked extendible dense arrays for scientific data storage

Several meetings of the Extremely Large Databases Community for large scale scientific applications advocate the use of multidimensional arrays as the appropriate model for representing scientific databases. Scientific databases gradually grow to ...

Comments

Information & Contributors

Information

Published In

DOLAP '07: Proceedings of the ACM tenth international workshop on Data warehousing and OLAP

November 2007

112 pages

ISBN:9781595938275

DOI:10.1145/1317331

General Chair:
Il-Yeol Song
Drexel University, USA
,
Program Chair:
Torben Bach Pedersen
Aalborg University, Denmark

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 09 November 2007

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

CIKM07

Sponsor:

CIKM07: Conference on Information and Knowledge Management

November 9, 2007

Lisbon, Portugal

Acceptance Rates

Overall Acceptance Rate 29 of 79 submissions, 37%

Upcoming Conference

CIKM '25

Sponsor:
sigir
sigir

The 34th ACM International Conference on Information and Knowledge Management

November 10 - 14, 2025

Seoul , Republic of Korea

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

26
Total Citations
View Citations
547
Total Downloads

Downloads (Last 12 months)8
Downloads (Last 6 weeks)1

Reflects downloads up to 07 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

View all

Sudhir SCafarella MMadden S(2022)Replicated layout for in-memory database systemsProceedings of the VLDB Endowment10.14778/3503585.350360615:4(984-997)Online publication date: 14-Apr-2022
https://dl.acm.org/doi/10.14778/3503585.3503606
Han SLiu XLi J(2022)Efficient Partitioning Method for Optimizing the Compression on Array DataJournal of Computer Science and Technology10.1007/s11390-022-2371-737:5(1049-1067)Online publication date: 30-Sep-2022
https://doi.org/10.1007/s11390-022-2371-7
Han SLiu XLi J(2022)Chunk-oriented dimension ordering for efficient range query processing on sparse multidimensional dataWorld Wide Web10.1007/s11280-022-01098-z26:4(1395-1433)Online publication date: 9-Sep-2022
https://doi.org/10.1007/s11280-022-01098-z
Patel SRhodes P(2021)Decentralized Storage for Scientific Data2021 IEEE International Conference on Big Data (Big Data)10.1109/BigData52589.2021.9671480(3760-3769)Online publication date: 15-Dec-2021
https://doi.org/10.1109/BigData52589.2021.9671480
Omar MAzharul Hasan KTsuji T(2021)A scalable array storage for efficient maintenance of future dataThe Journal of Supercomputing10.1007/s11227-020-03554-xOnline publication date: 2-Jan-2021
https://doi.org/10.1007/s11227-020-03554-x
Baldacci LGolfarelli MGraziani SRizzi S(2017)QETL: An approach to on-demand ETL from non-owned data sourcesData & Knowledge Engineering10.1016/j.datak.2017.09.002112(17-37)Online publication date: Nov-2017
https://doi.org/10.1016/j.datak.2017.09.002
Azharul Hasan KShaikh M(2017)Efficient representation of higher-dimensional arrays by dimension transformationsThe Journal of Supercomputing10.1007/s11227-016-1954-x73:6(2801-2822)Online publication date: 1-Jun-2017
https://dl.acm.org/doi/10.1007/s11227-016-1954-x
MAKINO MTSUJI THIGUCHI K(2016)History-Pattern Encoding for Large-Scale Dynamic Multidimensional Datasets and Its EvaluationsIEICE Transactions on Information and Systems10.1587/transinf.2015DAP0025E99.D:4(989-999)Online publication date: 2016
https://doi.org/10.1587/transinf.2015DAP0025
Omar MHasan KAnjum AZhao X(2016)A scalable storage system for structured data based on higher order index arrayProceedings of the 3rd IEEE/ACM International Conference on Big Data Computing, Applications and Technologies10.1145/3006299.3006333(247-252)Online publication date: 6-Dec-2016
https://dl.acm.org/doi/10.1145/3006299.3006333
Nguyen CRhodes P(2016)Accelerating range queries for large-scale unstructured meshes2016 IEEE International Conference on Big Data (Big Data)10.1109/BigData.2016.7840641(502-511)Online publication date: Dec-2016
https://doi.org/10.1109/BigData.2016.7840641
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Abstract

References

Cited By

Index Terms

Recommendations

Multilevel chunking of multidimensional arrays

Optimal chunking and partial caching in information-centric networks

Chunked extendible dense arrays for scientific data storage

Comments

Information

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Login options

Full Access

View options

PDF

eReader

Share

Share this Publication link

Share on social media

Affiliations