research-article

Scalable in situ scientific data encoding for analytical query processing

Authors:
Sriram Lakshminarasimhan

North Carolina State University, Oak Ridge National Laboratory, Raleigh, NC, USA

North Carolina State University, Oak Ridge National Laboratory, Raleigh, NC, USA
View Profile

,
David A. Boyuka

North Carolina State University, Oak Ridge National Laboratory, Raleigh, NC, USA

North Carolina State University, Oak Ridge National Laboratory, Raleigh, NC, USA
View Profile

,
Saurabh V. Pendse

North Carolina State University, Oak Ridge National Laboratory, Raleigh, NC, USA

North Carolina State University, Oak Ridge National Laboratory, Raleigh, NC, USA
View Profile

,
Xiaocheng Zou

North Carolina State University, Oak Ridge National Laboratory, Raleigh, NC, USA

North Carolina State University, Oak Ridge National Laboratory, Raleigh, NC, USA
View Profile

,
John Jenkins

North Carolina State University, Oak Ridge National Laboratory, Raleigh, NC, USA

North Carolina State University, Oak Ridge National Laboratory, Raleigh, NC, USA
View Profile

,
Venkatram Vishwanath

Argonne National Laboratory, Argonne, IL, USA

Argonne National Laboratory, Argonne, IL, USA
View Profile

,
Michael E. Papka

Argonne National Laboratory and Northern Illinois University, Argonne, IL, USA

Argonne National Laboratory and Northern Illinois University, Argonne, IL, USA
View Profile

,
Nagiza F. Samatova

North Carolina State University, Oak Ridge National Laboratory, Raleigh, NC, USA

North Carolina State University, Oak Ridge National Laboratory, Raleigh, NC, USA
View Profile

HPDC '13: Proceedings of the 22nd international symposium on High-performance parallel and distributed computingJune 2013Pages 1–12https://doi.org/10.1145/2493123.2465527

Published:17 June 2013Publication History

HPDC '13: Proceedings of the 22nd international symposium on High-performance parallel and distributed computing

Pages 1–12

ABSTRACT

The process of scientific data analysis in high-performance computing environments has been evolving along with the advancement of computing capabilities. With the onset of exascale computing, the increasing gap between compute performance and I/O bandwidth has rendered the traditional method of post-simulation processing a tedious process. Despite the challenges due to increased data production, there exists an opportunity to benefit from "cheap" computing power to perform query-driven exploration and visualization during simulation time. To accelerate such analyses, applications traditionally augment raw data with large indexes, post-simulation, which are then repeatedly utilized for data exploration. However, the generation of current state-of-the-art indexes involve a compute- and memory-intensive processing, thus rendering them inapplicable in an in situ context. In this paper we propose DIRAQ, a parallel in situ, in network data encoding and reorganization technique that enables the transformation of simulation output into a query-efficient form, with negligible runtime overhead to the simulation run. DIRAQ begins with an effective core-local, precision-based encoding approach, which incorporates an embedded compressed index that is 3 -- 6x smaller than current state-of-the-art indexing schemes. DIRAQ then applies an in network index merging strategy, enabling the creation of aggregated indexes ideally suited for spatial-context querying that speed up query responses by up to 10x versus alternative techniques. We also employ a novel aggregation strategy that is topology-, data-, and memory-aware, resulting in efficient I/O and yielding overall end-to-end encoding and I/O time that is less than that required to write the raw data with MPI collective I/O.

References

H. Abbasi, G. Eisenhauer, M. Wolf, K. Schwan, and S. Klasky. Just in time: adding value to the IO pipelines of high performance applications with JITStaging. In Proc. Symp. High Performance Distributed Computing (HPDC), 2011. Google ScholarDigital Library
H. Abbasi, J. Lofstead, F. Zheng, K. Schwan, M. Wolf, and S. Klasky. Extending I/O through high performance data services. In Proc. Conf. Cluster Computing (CLUSTER), Sep 2009.Google ScholarCross Ref
H. Abbasi, M. Wolf, G. Eisenhauer, S. Klasky, K. Schwan, and F. Zheng. DataStager: scalable data staging services for petascale applications. In Proc. Symp. High Performance Distributed Computing (HPDC), 2009. Google ScholarDigital Library
J. C. Bennett, H. Abbasi, P.-T. Bremer, R. Grout, A. Gyulassy, T. Jin, S. Klasky, H. Kolla, M. Parashar, V. Pascucci, P. Pebay, D. Thompson, H. Yu, F. Zhang, and J. Chen. Combining in-situ and in-transit processing to enable extreme-scale scientific analysis. In Proc. Conf. High Performance Computing, Networking, Storage and Analysis (SC), 2012. Google ScholarDigital Library
S. Byna, J. Chou, O. Rübel, Prabhat, H. Karimabadi, W. S. Daughton, V. Roytershteyn, E. W. Bethel, M. Howison, K.-J. Hsu, K.-W. Lin, A. Shoshani, A. Uselton, and K. Wu. Parallel I/O, analysis, and visualization of a trillion particle simulation. In Proc. Conf. High Performance Computing, Networking, Storage and Analysis (SC), 2012. Google ScholarDigital Library
M. Chaarawi and E. Gabriel. Automatically selecting the number of aggregators for collective I/O operations. In Proc. Conf. Cluster Computing (CLUSTER), 2011. Google ScholarDigital Library
J. H. Chen, A. Choudhary, B. de Supinski, M. DeVries, E. R. Hawkes, S. Klasky, W.-K. Liao, K.-L. Ma, J. Mellor-Crummey, N. Podhorszki, R. Sankaran, S. Shende, and C. S. Yoo. Terascale direct numerical simulations of turbulent combustion using S3D. Journal of Computational Science & Discovery (CSD), 2(1), 2009.Google Scholar
J. Chou, K. Wu, and Prabhat. FastQuery: a parallel indexing system for scientific data. In Proc. Conf. Cluster Computing (CLUSTER), 2011. Google ScholarDigital Library
J. Chou, K. Wu, O. Rübel, M. Howison, J. Qiang, Prabhat, B. Austin, E. W. Bethel, R. D. Ryne, and A. Shoshani. Parallel index and query for large scale data analysis. In Proc. Conf. High Performance Computing, Networking, Storage and Analysis (SC), Nov 2011. Google ScholarDigital Library
J. M. del Rosario, R. Bordawekar, and A. Choudhary. Improved parallel I/O via a two-phase run-time access strategy. ACM SIGARCH Computer Architecture News, 21(5):31--38, Dec 1993. Google ScholarDigital Library
B. Fryxell, K. Olson, P. Ricker, F. X. Timmes, M. Zingale, D. Q. Lamb, P. MacNeice, R. Rosner, J. W. Truran, and H. Tufo. FLASH: an adaptive mesh hydrodynamics code for modeling astrophysical thermonuclear flashes. Astrophysical Journal Supplement Series, 131:273--334, Nov 2000.Google ScholarCross Ref
J. Fu, R. Latham, M. Min, and C. D. Carothers. I/O threads to reduce checkpoint blocking for an electromagnetics solver on Blue Gene/P and Cray XK6. In Proc. Workshop on Runtime and Operating Systems for Supercomputers (ROSS), 2012. Google ScholarDigital Library
J. Fu, M. Min, R. Latham, and C. D. Carothers. Parallel I/O performance for application-level checkpointing on the Blue Gene/P system. In Proc. Conf. Cluster Computing (CLUSTER), 2011. Google ScholarDigital Library
K. Hornik, M. Stinchcombe, and H. White. Multilayer feedforward networks are universal approximators. Proc. Conf. Neural Networks, Jul 1989.Google ScholarCross Ref
C. Igel and M. Hüsken. Empirical evaluation of the improved Rprop learning algorithm. Journal of Neurocomputing, 50:2003, 2003.Google ScholarCross Ref
J. Jenkins, I. Arkatkar, S. Lakshminarasimhan, N. Shah, E. R. Schendel, S. Ethier, C.-S. Chang, J. H. Chen, H. Kolla, S. Klasky, R. B. Ross, and N. F. Samatova. Analytics-driven lossless data compression for rapid in-situ indexing, storing, and querying. In Proc. Conf. Database and Expert Systems Applications, Part II (DEXA), 2012.Google ScholarCross Ref
J. Kim, H. Abbasi, L. Chacon, C. Docan, S. Klasky, Q. Liu, N. Podhorszki, A. Shoshani, and K. Wu. Parallel in situ indexing for data-intensive computing. In Proc. Symp. Large Data Analysis and Visualization (LDAV), Oct 2011.Google ScholarCross Ref
S. Kumar, V. Vishwanath, P. Carns, J. A. Levine, R. Latham, G. Scorzelli, H. Kolla, R. Grout, R. Ross, M. E. Papka, J. Chen, and V. Pascucci. Efficient data restructuring and aggregation for I/O acceleration in PIDX. In Proc. Conf. High Performance Computing, Networking, Storage and Analysis (SC), 2012. Google ScholarDigital Library
K. L. Ma. In situ visualization at extreme scale: challenges and opportunities. Journal of Computer Graphics and Application (CG&A), pages 14--19, 2009. Google ScholarDigital Library
S. Nissen. Implementation of a fast artificial neural network library (fann). Technical report, Department of Computer Science University of Copenhagen (DIKU), Oct 2003. http://fann.sf.net.Google Scholar
O. Rübel, Prabhat, K. Wu, H. Childs, J. Meredith, C. G. R. Geddes, E. Cormier-Michel, S. Ahern, G. H. Weber, P. Messmer, H. Hagen, B. Hamann, and E. W. Bethel. High performance multivariate visual data exploration for extremely large data. In Proc. Conf. High Performance Computing, Networking, Storage and Analysis (SC), 2008. Google ScholarDigital Library
F. Schmuck and R. Haskin. GPFS: a shared-disk file system for large computing clusters. In Proc. Conf. File and Storage Technologies (FAST), Jan 2002. Google ScholarDigital Library
R. Thakur and A. Choudhary. An extended two-phase method for accessing sections of out-of-core arrays. Journal of Scientific Programming, 5(4):301--317, Dec 1996. Google ScholarDigital Library
T. Tu, H. Yu, J. Bielak, O. Ghattas, J. C. Lopez, K.-L. Ma, D. R. O'Hallaron, L. Ramirez-Guzman, N. Stone, R. Taborda-Rios, and J. Urbanic. Remote runtime steering of integrated terascale simulation and visualization. In Proc. Conf. High Performance Computing, Networking, Storage and Analysis (SC), 2006. Google ScholarDigital Library
V. Vishwanath, M. Hereld, V. Morozov, and M. E. Papka. Topology-aware data movement and staging for I/O acceleration on Blue Gene/P supercomputing systems. In Proc. Conf. High Performance Computing, Networking, Storage and Analysis (SC), pages 1--11, 2011. Google ScholarDigital Library
K. Wu. FastBit: an efficient indexing technology for accelerating data-intensive science. In Journal of Physics: Conference Series (JPCS), volume 16, page 556, 2005.Google Scholar
K. Wu, E. Otoo, and A. Shoshani. On the performance of bitmap indices for high cardinality attributes. In Proc. Conf Very Large Data Bases (VLDB), 2004. Google ScholarDigital Library
K. Wu, R. R. Sinha, C. Jones, S. Ethier, S. Klasky, K.-L. Ma, A. Shoshani, and M. Winslett. Finding regions of interest on toroidal meshes. Journal Computational Science & Discovery (CSD), 4(1), 2011.Google Scholar
H. Yan, S. Ding, and T. Suel. Inverted index compression and query processing with optimized document ordering. In Proc. Conf. World Wide Web (WWW), 2009. Google ScholarDigital Library
R. M. Yoo, H. Lee, K. Chow, and H.-H. S. Lee. Constructing a non-linear model with neural networks for workload characterization. In Proc. Symp. Workload Characterization (IISWC), Oct 2006.Google ScholarCross Ref
H. Yu, C. Wang, R. W. Grout, J. H. Chen, and K.-L. Ma. In situ visualization for large-scale combustion simulations. Journal of Computer Graphics and Applications (CG&A), 30(3):45 --57, May-Jun 2010. Google ScholarDigital Library
J. Zhang, X. Long, and S. Torsten. Performance of compressed inverted list caching in search engines. In Proc. Conf. World Wide Web (WWW), 2008. Google ScholarDigital Library
F. Zheng, H. Abbasi, C. Docan, J. Lofstead, Q. Liu, S. Klasky, M. Parashar, N. Podhorszki, K. Schwan, and M. Wolf. PreDatA: preparatory data analytics on peta-scale machines. In Proc. Symp. Parallel Distributed Processing (IPDPS), Apr 2010.Google ScholarCross Ref
M. Zukowski, S. Heman, N. Nes, and P. Boncz. Super-scalar RAM-CPU cache compression. In Proc. Conf. Data Engineering (ICDE), 2006. Google ScholarDigital Library

Index Terms

Scalable in situ scientific data encoding for analytical query processing
1. Information systems
  1. Information retrieval
    1. Document representation
    2. Search engine architectures and scalability
      1. Search engine indexing
2. Software and its engineering
  1. Software organization and properties
    1. Contextual software domains
      1. Operating systems
        Memory management
        Secondary storage

Recommendations

Scalable in situ scientific data encoding for analytical query processing
HPDC '13: Proceedings of the 22nd international symposium on High-performance parallel and distributed computing

The process of scientific data analysis in high-performance computing environments has been evolving along with the advancement of computing capabilities. With the onset of exascale computing, the increasing gap between compute performance and I/O ...
Read More
DIRAQ: scalable in situ data- and resource-aware indexing for optimized query performance

Scientific data analytics in high-performance computing environments has been evolving along with the advancement of computing capabilities. With the onset of exascale computing, the increasing gap between compute performance and I/O bandwidth has ...
Read More
Optimizing bitmap indices with efficient compression

Bitmap indices are efficient for answering queries on low-cardinality attributes. In this article, we present a new compression scheme called Word-Aligned Hybrid (WAH) code that makes compressed bitmap indices efficient even for high-cardinality ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
HPDC '13: Proceedings of the 22nd international symposium on High-performance parallel and distributed computing
June 2013
276 pages
ISBN:9781450319102
DOI:10.1145/2493123
General Chairs:
Manish Parashar
Rutgers University, USA
,
Jon Weissman
University of Minnesota, USA
,
Program Chairs:
Dick Epema
Delft University of Technology and Eindhoven University of Technology, The Netherlands
,
Renato Figueiredo
University of Florida, USA and Vrije Universiteit, The Netherlands
Copyright © 2013 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 17 June 2013
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
compression
exascale computing
indexing
query processing
Qualifiers
- research-article
Conference

Acceptance Rates
HPDC '13 Paper Acceptance Rate20of131submissions,15%Overall Acceptance Rate166of966submissions,17%
More
Upcoming Conference
HPDC '24

Sponsor:

sigarch

The 33rd International Symposium on High-Performance Parallel and Distributed Computing

June 3 - 7, 2024

Pisa , Italy
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 26
  Total Citations
  View Citations
- 370
  Total Downloads
- Downloads (Last 12 months)12
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Scalable in situ scientific data encoding for analytical query processing

HPDC '13: Proceedings of the 22nd international symposium on High-performance parallel and distributed computing

ABSTRACT

References

Cited By

Index Terms

Recommendations

Scalable in situ scientific data encoding for analytical query processing

DIRAQ: scalable in situ data- and resource-aware indexing for optimized query performance

Optimizing bitmap indices with efficient compression

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Scalable in situ scientific data encoding for analytical query processing

HPDC '13: Proceedings of the 22nd international symposium on High-performance parallel and distributed computing

ABSTRACT

References

Cited By

Index Terms

Recommendations

Scalable in situ scientific data encoding for analytical query processing

DIRAQ: scalable in situ data- and resource-aware indexing for optimized query performance

Optimizing bitmap indices with efficient compression

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media