ABSTRACT
As applications continue to generate multi-dimensional data at exponentially increasing rates, fast analytics to extract meaningful results is becoming extremely important. The database community has developed array databases that alleviate this problem through a series of techniques. In-situ mechanisms provide direct access to raw data in the original format---without loading and partitioning. Parallel processing scales to the largest datasets. In-memory caching reduces latency when the same data are accessed across a workload of queries. However, we are not aware of any work on distributed caching of multi-dimensional raw arrays. In this paper, we introduce a distributed framework for cost-based caching of multi-dimensional arrays in native format. Given a set of files that contain portions of an array and an online query workload, the framework computes an effective caching plan in two stages. First, the plan identifies the cells to be cached locally from each of the input files by continuously refining an evolving R-tree index. In the second stage, an optimal assignment of cells to nodes that collocates dependent cells in order to minimize the overall data transfer is determined. We design cache eviction and placement heuristic algorithms that consider the historical query workload. A thorough experimental evaluation over two real datasets in three file formats confirms the superiority - by as much as two orders of magnitude - of the proposed framework over existing techniques in terms of cache overhead and workload execution time.
- M. Abadi et al. TensorFlow: A System for Large-Scale Machine Learning. In OSDI 2016. Google ScholarDigital Library
- A. Abouzied et al. Invisible Loading: Access-Driven Data Transfer from Raw Files into Database Systems. In EDBT/ICDT 2013. Google ScholarDigital Library
- M. Abrams, C. R. Standridge, G. Abdulla, S. Williams, and E. A. Fox. Caching Proxies: Limitations and Potentials. 1995.Google Scholar
- I. Alagiannis, R. Borovica, M. Branco, S. Idreos, and A. Ailamaki. NoDB: Efficient Query Execution on Raw Data Files. In SIGMOD 2012. Google ScholarDigital Library
- M. Altinel, C. Bornhövd, S. Krishnamurthy, C. Mohan, H. Pirahesh, and B. Reinwald. Cache Tables: Paving the Way for an Adaptive Database Cache. In VLDB 2003. Google ScholarDigital Library
- N. Alur et al. IBM DataStage Data Flow and Job Design. 2008.Google Scholar
- T. Azim, M. Karpathiotakis, and A. Ailamaki. ReCache: Reactive Caching for Fast Analytics over Heterogeneous Data. PVLDB, 11(3), 2017. Google ScholarDigital Library
- P. Baumann, A. Dehmel, P. Furtado, R. Ritsch, and N. Widmann. The Multidimensional Database System RasDaMan. In SIGMOD 1998. Google ScholarDigital Library
- S. Blanas, K. Wu, S. Byna, B. Dong, and A. Shoshani. Parallel Data Analysis Directly on Scientific File Formats. In SIGMOD 2014. Google ScholarDigital Library
- P. Brown et al. Overview of SciDB: Large Scale Array Storage, Processing and Analysis. In SIGMOD 2010. Google ScholarDigital Library
- J. B. Buck, N. Watkins, J. LeFevre, K. Ioannidou, C. Maltzahn, N. Polyzotis, and S. Brandt. SciHadoop: Array-based Query Processing in Hadoop. In SC 2011. Google ScholarDigital Library
- P. Cao and S. Irani. Cost-Aware WWW Proxy Caching Algorithms. In USENIX ITS 1997. Google ScholarDigital Library
- Y. Cheng and F. Rusu. Parallel In-Situ Data Processing with Speculative Loading. In SIGMOD 2014. Google ScholarDigital Library
- Y. Cheng and F. Rusu. Formal Representation of the SS-DB Benchmark and Experimental Evaluation in EXTASCID. Distrib. and Parallel Databases, 2014. Google ScholarDigital Library
- Y. Cheng, W. Zhao, and F. Rusu. Bi-Level Online Aggregation on Raw Data. In SSDBM 2017. Google ScholarDigital Library
- H.-T. Chou and D.J. DeWitt. An Evaluation of Buffer Management Strategies for Relational Database Systems. Algorithmica, 1(1--4):311--336, 1986.Google Scholar
- R. Cornacchia, S. Heman, M. Zukowski, A. P. de Vries, and P. Boncz. Flexible and Efficient IR using Array Databases. VLDB Journal (VLDBJ), 17, 2008. Google ScholarDigital Library
- P. Cudre-Mauroux, H. Kimura, K.-T. Lim, J. Rogers, S. Madden, M. Stonebraker, S. B. Zdonik, and P. G. Brown. SS-DB: A Standard Science DBMS Benchmark. http://www.xldb.org/science-benchmark/.Google Scholar
- P. Cudre-Mauroux, E. Wu, and S. Madden. TrajStore: An Adaptive Storage System for Very Large Trajectory Data Sets. In ICDE 2010.Google Scholar
- S. Dar, M. J. Franklin, B. T. Jonsson, D. Srivastava, M. Tan, et al. Semantic Data Caching and Replacement. In VLDB 1996. Google ScholarDigital Library
- A. D. Sarma, Y. He, and S. Chaudhuri. ClusterJoin: A Similarity Joins Framework using MapReduce. PVLDB, 7, 2014. Google ScholarDigital Library
- J. Duggan, O. Papaemmanouil et al. Skew-Aware Join Optimization for Array Databases. In SIGMOD 2015. Google ScholarDigital Library
- A. Floratou, N. Megiddo, N. Potti, F. Özcan, U. Kale, and J. Schmitz-Hermes. Adaptive Caching in Big SQL using the HDFS Cache. In SoCC 2016. Google ScholarDigital Library
- P. Furtado and P. Baumann. Storage of Multidimensional Arrays Based on Arbitrary Tiling. In ICDE 1999.Google Scholar
- A. Guttman. R-trees: A Dynamic Index Structure for Spatial Searching. In SIGMOD 1984. Google ScholarDigital Library
- D. Han, Y.-M. Nam, and M.-S. Kim. A Distributed In-Situ Analysis Method for Large-Scale Scientific Data. In BigComp 2017.Google Scholar
- H. Hu, J. Xu, W. S. Wong, B. Zheng, D. L. Lee, and W.-C. Lee. Proactive Caching for Spatial Queries in Mobile Environments. In ICDE 2005. Google ScholarDigital Library
- S. Idreos et al. MonetDB: Two Decades of Research in Column-Oriented Database Architectures. IEEE Data Eng. Bull., 35(1), 2012.Google Scholar
- S. Idreos, I. Alagiannis et al. Here Are My Data Files. Here Are My Queries. Where Are My Results? In CIDR 2011.Google Scholar
- S. Idreos, M. L. Kersten, and S. Manegold. Database Cracking In CIDR 2007.Google Scholar
- S. Irani. Page Replacement with Multi-Size Pages and Applications to Web Caching. In STOC 1997. Google ScholarDigital Library
- M. Ivanova, M. L. Kersten, and S. Manegold. Data Vaults: A Symbiosis between Database Technology and Scientific File Repositories. In SSDBM 2012. Google ScholarDigital Library
- M. Karpathiotakis, I. Alagiannis, and A. Ailamaki. Fast Queries over Heterogeneous Data through Engine Customization. PVLDB, 9(12):972--983, 2016. Google ScholarDigital Library
- M. Karpathiotakis, I. Alagiannis, T. Heinis, M. Branco, and A. Ailamaki. Just-In-Time Data Virtualization: Lightweight Data Management with ViDa. In CIDR 2015.Google Scholar
- M. Karpathiotakis, M. Branco, I. Alagiannis, and A. Ailamaki. Adaptive Query Processing on RAW Data. PVLDB, 7, 2014. Google ScholarDigital Library
- M. Kornacker et al. Impala: A Modern, Open-Source SQL Engine for Hadoop. In CIDR 2015.Google Scholar
- N. M. Law et al. The Palomar Transient Factory: System Overview, Performance and First Results. CoRR, abs/0906.5350, 2009.Google Scholar
- D. Lee, J. Choi, J.-H. Kim, S. H. Noh, S. L. Min, Y. Cho, and C. S. Kim. LRFU: A Spectrum of Policies that Subsumes the Least Recently Used and Least Frequently Used Policies. IEEE Trans. on Computers, 50(12):1352--1361, 2001. Google ScholarDigital Library
- H. Li, A. Ghodsi, M. Zaharia, S. Shenker, and I. Stoica. Tachyon: Reliable, Memory Speed Storage for Cluster Computing Frameworks. In SoCC 2014. Google ScholarDigital Library
- K.-T. Lim, D. Maier, J. Becla, M. Kersten, Y. Zhang, and M. Stonebraker. ArrayQL Syntax. http://www.xldb.org/wp-content/uploads/2012/09/ArrayQL-Draft-4.pdf. {Online; February 2017}.Google Scholar
- C. Lübbe. Issues on Distributed Caching of Spatial Data. University of Stuttgart, Germany, 2017.Google Scholar
- Y. Ma, F. Rusu, and M. Torres. Stochastic Gradient Descent on Highly-Parallel Architectures. CoRR, abs/1802.08800, 2018.Google Scholar
- D. Maier. ArrayQL Algebra: version 3. http://www.xldb.org/wp-content/uploads/2012/09/ArrayQL_Algebra_v3+.pdf. {Online; February 2017}.Google Scholar
- T. Muhlbauer, W. Rodiger, R. Seilbeck et al. Instant Loading for Main Memory Databases. PVLDB, 6(14), 2013. Google ScholarDigital Library
- J. Nelson, B. Holt, B. Myers, P. Briggs, L. Ceze, S. Kahan, and M. Oskin. Latency-Tolerant Software Distributed Shared Memory. In USENIX ATC 2015. Google ScholarDigital Library
- M. Olma, M. Karpathiotakis, I. Alagiannis, M. Athanassoulis, and A. Ailamaki. Slalom: Coasting through Raw Data via Adaptive Partitioning and Indexing. PVLDB, 10(10):1106--1117, 2017. Google ScholarDigital Library
- E. J. O'Neil, P. E. O'Neil, and G. Weikum. The LRU-K Page Replacement Algorithm for Database Disk Buffering. In SIGMOD 1993. Google ScholarDigital Library
- J. Ousterhout, P. Agrawal, D. Erickson, C. Kozyrakis, J. Leverich, D. Mazières, S. Mitra, A. Narayanan, D. Ongaro, and G. Parulkar. The Case for RAMCloud. Communications of the ACM, 54(7):121--130, 2011. Google ScholarDigital Library
- S. Papadopoulos, K. Datta, S. Madden, and T. Mattson. The TileDB Array Data Storage Manager. PVLDB, 10(4):349--360, 2016. Google ScholarDigital Library
- F. Rusu and Y. Cheng. A Survey on Array Storage, Query Languages, and Systems. CoRR, abs/1302.0103, 2013.Google Scholar
- S. Sarawagi and M. Stonebraker. Efficient Organization of Large Multidimensional Arrays. In ICDE 1994. Google ScholarDigital Library
- E. Soroush, M. Balazinska, and D. Wang. ArrayStore: A Storage Manager for Complex Parallel Array Processing. In SIGMOD 2011. Google ScholarDigital Library
- E. Sparks et al. MLI: An API for Distributed Machine Learning. In ICDM 2013.Google Scholar
- V. Sourlas, L. Gkatzikis, P. Flegkas, and L. Tassiulas. Distributed Cache Management in Information-Centric Networks. IEEE Trans. on Network and Service Management, 10(3):286--299, 2013.Google ScholarCross Ref
- D. Tahara, T. Diamond, and D. J. Abadi. Sinew: A SQL System for Multi-Structured Data. In SIGMOD 2014. Google ScholarDigital Library
- F. Tauheedy, T. Heinis, and A. Ailamaki. THERMAL-JOIN: A Scalable Spatial Join for Dynamic Workloads. In SIGMOD 2015. Google ScholarDigital Library
- A. R. van Ballegooij. RAM: A Multidimensional Array DBMS. In EDBT 2004. Google ScholarDigital Library
- X. Wang, T. Malik, R. Burns, S. Papadomanolakis, and A. Ailamaki. A Workload-Driven Unit of Cache Replacement for Mid-Tier Database Caching. In DASFAA 2007. Google ScholarDigital Library
- A. Witkowski, M. Colgan, A. Brumm, T. Cruanes, and H. Baer. Performant and Scalable Data Loading with Oracle Database 11g, 2011. Oracle Corp.Google Scholar
- R. P. Wooster and M. Abrams. Proxy Caching that Estimates Page Load Delays. Computer Networks and ISDN Systems, 29(8):977--986, 1997. Google ScholarDigital Library
- H. Xing, S. Floratos, S. Blanas, S. Byna, K. Wu, P. Brown, et al. ArrayBridge: Interweaving Declarative Array Processing with High-Performance Computing. arXiv preprint arXiv.1702.08327, 2017.Google Scholar
- Y. Zhang, H. Herodotos, and J. Yang. RIOT: I/O-Efficient Numerical Computing without SQL. In CIDR 2009.Google Scholar
- Y. Zhang, M. Kersten, M. Ivanova, and N. Nes. SciQL: Bridging the Gap between Science and Relational DBMS. In IDEAS 2011. Google ScholarDigital Library
- W. Zhao, Y. Cheng, and F. Rusu. Vertical Partitioning for Query Processing over Raw Data. In SSDBM 2015. Google ScholarDigital Library
- W. Zhao, F. Rusu, B. Dong, and K. Wu. Similarity Join over Array Data. In SIGMOD 2016. Google ScholarDigital Library
- W. Zhao, F. Rusu, B. Dong, K. Wu, and P. Nugent. Incremental View Maintenance over Array Data. In SIGMOD 2017. Google ScholarDigital Library
- Distributed caching for processing raw arrays
Recommendations
Selective Victim Caching: A Method to Improve the Performance of Direct-Mapped Caches
Although direct-mapped caches suffer from higher miss ratios as compared to set-associative caches, they are attractive for today's high-speed pipelined processors that require very low access times. Victim caching was proposed by Jouppi [1] as an ...
Distributed cooperative caching
PACT '08: Proceedings of the 17th international conference on Parallel architectures and compilation techniquesThis paper presents the Distributed Cooperative Caching, a scalable and energy-efficient scheme to manage chip multiprocessor (CMP) cache resources. The proposed configuration is based in the Cooperative Caching framework [3] but it is intended for ...
Cooperative Caching for GPUs
The rise of general-purpose computing on GPUs has influenced architectural innovation on them. The introduction of an on-chip cache hierarchy is one such innovation. High L1 miss rates on GPUs, however, indicate inefficient cache usage due to myriad ...
Comments