skip to main content
10.1145/3221269.3221295acmotherconferencesArticle/Chapter ViewAbstractPublication PagesssdbmConference Proceedingsconference-collections
research-article

Distributed caching for processing raw arrays

Published:09 July 2018Publication History

ABSTRACT

As applications continue to generate multi-dimensional data at exponentially increasing rates, fast analytics to extract meaningful results is becoming extremely important. The database community has developed array databases that alleviate this problem through a series of techniques. In-situ mechanisms provide direct access to raw data in the original format---without loading and partitioning. Parallel processing scales to the largest datasets. In-memory caching reduces latency when the same data are accessed across a workload of queries. However, we are not aware of any work on distributed caching of multi-dimensional raw arrays. In this paper, we introduce a distributed framework for cost-based caching of multi-dimensional arrays in native format. Given a set of files that contain portions of an array and an online query workload, the framework computes an effective caching plan in two stages. First, the plan identifies the cells to be cached locally from each of the input files by continuously refining an evolving R-tree index. In the second stage, an optimal assignment of cells to nodes that collocates dependent cells in order to minimize the overall data transfer is determined. We design cache eviction and placement heuristic algorithms that consider the historical query workload. A thorough experimental evaluation over two real datasets in three file formats confirms the superiority - by as much as two orders of magnitude - of the proposed framework over existing techniques in terms of cache overhead and workload execution time.

References

  1. M. Abadi et al. TensorFlow: A System for Large-Scale Machine Learning. In OSDI 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. A. Abouzied et al. Invisible Loading: Access-Driven Data Transfer from Raw Files into Database Systems. In EDBT/ICDT 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. M. Abrams, C. R. Standridge, G. Abdulla, S. Williams, and E. A. Fox. Caching Proxies: Limitations and Potentials. 1995.Google ScholarGoogle Scholar
  4. I. Alagiannis, R. Borovica, M. Branco, S. Idreos, and A. Ailamaki. NoDB: Efficient Query Execution on Raw Data Files. In SIGMOD 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. M. Altinel, C. Bornhövd, S. Krishnamurthy, C. Mohan, H. Pirahesh, and B. Reinwald. Cache Tables: Paving the Way for an Adaptive Database Cache. In VLDB 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. N. Alur et al. IBM DataStage Data Flow and Job Design. 2008.Google ScholarGoogle Scholar
  7. T. Azim, M. Karpathiotakis, and A. Ailamaki. ReCache: Reactive Caching for Fast Analytics over Heterogeneous Data. PVLDB, 11(3), 2017. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. P. Baumann, A. Dehmel, P. Furtado, R. Ritsch, and N. Widmann. The Multidimensional Database System RasDaMan. In SIGMOD 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. S. Blanas, K. Wu, S. Byna, B. Dong, and A. Shoshani. Parallel Data Analysis Directly on Scientific File Formats. In SIGMOD 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. P. Brown et al. Overview of SciDB: Large Scale Array Storage, Processing and Analysis. In SIGMOD 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. J. B. Buck, N. Watkins, J. LeFevre, K. Ioannidou, C. Maltzahn, N. Polyzotis, and S. Brandt. SciHadoop: Array-based Query Processing in Hadoop. In SC 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. P. Cao and S. Irani. Cost-Aware WWW Proxy Caching Algorithms. In USENIX ITS 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Y. Cheng and F. Rusu. Parallel In-Situ Data Processing with Speculative Loading. In SIGMOD 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Y. Cheng and F. Rusu. Formal Representation of the SS-DB Benchmark and Experimental Evaluation in EXTASCID. Distrib. and Parallel Databases, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Y. Cheng, W. Zhao, and F. Rusu. Bi-Level Online Aggregation on Raw Data. In SSDBM 2017. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. H.-T. Chou and D.J. DeWitt. An Evaluation of Buffer Management Strategies for Relational Database Systems. Algorithmica, 1(1--4):311--336, 1986.Google ScholarGoogle Scholar
  17. R. Cornacchia, S. Heman, M. Zukowski, A. P. de Vries, and P. Boncz. Flexible and Efficient IR using Array Databases. VLDB Journal (VLDBJ), 17, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. P. Cudre-Mauroux, H. Kimura, K.-T. Lim, J. Rogers, S. Madden, M. Stonebraker, S. B. Zdonik, and P. G. Brown. SS-DB: A Standard Science DBMS Benchmark. http://www.xldb.org/science-benchmark/.Google ScholarGoogle Scholar
  19. P. Cudre-Mauroux, E. Wu, and S. Madden. TrajStore: An Adaptive Storage System for Very Large Trajectory Data Sets. In ICDE 2010.Google ScholarGoogle Scholar
  20. S. Dar, M. J. Franklin, B. T. Jonsson, D. Srivastava, M. Tan, et al. Semantic Data Caching and Replacement. In VLDB 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. A. D. Sarma, Y. He, and S. Chaudhuri. ClusterJoin: A Similarity Joins Framework using MapReduce. PVLDB, 7, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. J. Duggan, O. Papaemmanouil et al. Skew-Aware Join Optimization for Array Databases. In SIGMOD 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. A. Floratou, N. Megiddo, N. Potti, F. Özcan, U. Kale, and J. Schmitz-Hermes. Adaptive Caching in Big SQL using the HDFS Cache. In SoCC 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. P. Furtado and P. Baumann. Storage of Multidimensional Arrays Based on Arbitrary Tiling. In ICDE 1999.Google ScholarGoogle Scholar
  25. A. Guttman. R-trees: A Dynamic Index Structure for Spatial Searching. In SIGMOD 1984. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. D. Han, Y.-M. Nam, and M.-S. Kim. A Distributed In-Situ Analysis Method for Large-Scale Scientific Data. In BigComp 2017.Google ScholarGoogle Scholar
  27. H. Hu, J. Xu, W. S. Wong, B. Zheng, D. L. Lee, and W.-C. Lee. Proactive Caching for Spatial Queries in Mobile Environments. In ICDE 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. S. Idreos et al. MonetDB: Two Decades of Research in Column-Oriented Database Architectures. IEEE Data Eng. Bull., 35(1), 2012.Google ScholarGoogle Scholar
  29. S. Idreos, I. Alagiannis et al. Here Are My Data Files. Here Are My Queries. Where Are My Results? In CIDR 2011.Google ScholarGoogle Scholar
  30. S. Idreos, M. L. Kersten, and S. Manegold. Database Cracking In CIDR 2007.Google ScholarGoogle Scholar
  31. S. Irani. Page Replacement with Multi-Size Pages and Applications to Web Caching. In STOC 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. M. Ivanova, M. L. Kersten, and S. Manegold. Data Vaults: A Symbiosis between Database Technology and Scientific File Repositories. In SSDBM 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. M. Karpathiotakis, I. Alagiannis, and A. Ailamaki. Fast Queries over Heterogeneous Data through Engine Customization. PVLDB, 9(12):972--983, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. M. Karpathiotakis, I. Alagiannis, T. Heinis, M. Branco, and A. Ailamaki. Just-In-Time Data Virtualization: Lightweight Data Management with ViDa. In CIDR 2015.Google ScholarGoogle Scholar
  35. M. Karpathiotakis, M. Branco, I. Alagiannis, and A. Ailamaki. Adaptive Query Processing on RAW Data. PVLDB, 7, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. M. Kornacker et al. Impala: A Modern, Open-Source SQL Engine for Hadoop. In CIDR 2015.Google ScholarGoogle Scholar
  37. N. M. Law et al. The Palomar Transient Factory: System Overview, Performance and First Results. CoRR, abs/0906.5350, 2009.Google ScholarGoogle Scholar
  38. D. Lee, J. Choi, J.-H. Kim, S. H. Noh, S. L. Min, Y. Cho, and C. S. Kim. LRFU: A Spectrum of Policies that Subsumes the Least Recently Used and Least Frequently Used Policies. IEEE Trans. on Computers, 50(12):1352--1361, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. H. Li, A. Ghodsi, M. Zaharia, S. Shenker, and I. Stoica. Tachyon: Reliable, Memory Speed Storage for Cluster Computing Frameworks. In SoCC 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. K.-T. Lim, D. Maier, J. Becla, M. Kersten, Y. Zhang, and M. Stonebraker. ArrayQL Syntax. http://www.xldb.org/wp-content/uploads/2012/09/ArrayQL-Draft-4.pdf. {Online; February 2017}.Google ScholarGoogle Scholar
  41. C. Lübbe. Issues on Distributed Caching of Spatial Data. University of Stuttgart, Germany, 2017.Google ScholarGoogle Scholar
  42. Y. Ma, F. Rusu, and M. Torres. Stochastic Gradient Descent on Highly-Parallel Architectures. CoRR, abs/1802.08800, 2018.Google ScholarGoogle Scholar
  43. D. Maier. ArrayQL Algebra: version 3. http://www.xldb.org/wp-content/uploads/2012/09/ArrayQL_Algebra_v3+.pdf. {Online; February 2017}.Google ScholarGoogle Scholar
  44. T. Muhlbauer, W. Rodiger, R. Seilbeck et al. Instant Loading for Main Memory Databases. PVLDB, 6(14), 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. J. Nelson, B. Holt, B. Myers, P. Briggs, L. Ceze, S. Kahan, and M. Oskin. Latency-Tolerant Software Distributed Shared Memory. In USENIX ATC 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. M. Olma, M. Karpathiotakis, I. Alagiannis, M. Athanassoulis, and A. Ailamaki. Slalom: Coasting through Raw Data via Adaptive Partitioning and Indexing. PVLDB, 10(10):1106--1117, 2017. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. E. J. O'Neil, P. E. O'Neil, and G. Weikum. The LRU-K Page Replacement Algorithm for Database Disk Buffering. In SIGMOD 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. J. Ousterhout, P. Agrawal, D. Erickson, C. Kozyrakis, J. Leverich, D. Mazières, S. Mitra, A. Narayanan, D. Ongaro, and G. Parulkar. The Case for RAMCloud. Communications of the ACM, 54(7):121--130, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. S. Papadopoulos, K. Datta, S. Madden, and T. Mattson. The TileDB Array Data Storage Manager. PVLDB, 10(4):349--360, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. F. Rusu and Y. Cheng. A Survey on Array Storage, Query Languages, and Systems. CoRR, abs/1302.0103, 2013.Google ScholarGoogle Scholar
  51. S. Sarawagi and M. Stonebraker. Efficient Organization of Large Multidimensional Arrays. In ICDE 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. E. Soroush, M. Balazinska, and D. Wang. ArrayStore: A Storage Manager for Complex Parallel Array Processing. In SIGMOD 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. E. Sparks et al. MLI: An API for Distributed Machine Learning. In ICDM 2013.Google ScholarGoogle Scholar
  54. V. Sourlas, L. Gkatzikis, P. Flegkas, and L. Tassiulas. Distributed Cache Management in Information-Centric Networks. IEEE Trans. on Network and Service Management, 10(3):286--299, 2013.Google ScholarGoogle ScholarCross RefCross Ref
  55. D. Tahara, T. Diamond, and D. J. Abadi. Sinew: A SQL System for Multi-Structured Data. In SIGMOD 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. F. Tauheedy, T. Heinis, and A. Ailamaki. THERMAL-JOIN: A Scalable Spatial Join for Dynamic Workloads. In SIGMOD 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. A. R. van Ballegooij. RAM: A Multidimensional Array DBMS. In EDBT 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. X. Wang, T. Malik, R. Burns, S. Papadomanolakis, and A. Ailamaki. A Workload-Driven Unit of Cache Replacement for Mid-Tier Database Caching. In DASFAA 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. A. Witkowski, M. Colgan, A. Brumm, T. Cruanes, and H. Baer. Performant and Scalable Data Loading with Oracle Database 11g, 2011. Oracle Corp.Google ScholarGoogle Scholar
  60. R. P. Wooster and M. Abrams. Proxy Caching that Estimates Page Load Delays. Computer Networks and ISDN Systems, 29(8):977--986, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. H. Xing, S. Floratos, S. Blanas, S. Byna, K. Wu, P. Brown, et al. ArrayBridge: Interweaving Declarative Array Processing with High-Performance Computing. arXiv preprint arXiv.1702.08327, 2017.Google ScholarGoogle Scholar
  62. Y. Zhang, H. Herodotos, and J. Yang. RIOT: I/O-Efficient Numerical Computing without SQL. In CIDR 2009.Google ScholarGoogle Scholar
  63. Y. Zhang, M. Kersten, M. Ivanova, and N. Nes. SciQL: Bridging the Gap between Science and Relational DBMS. In IDEAS 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  64. W. Zhao, Y. Cheng, and F. Rusu. Vertical Partitioning for Query Processing over Raw Data. In SSDBM 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  65. W. Zhao, F. Rusu, B. Dong, and K. Wu. Similarity Join over Array Data. In SIGMOD 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  66. W. Zhao, F. Rusu, B. Dong, K. Wu, and P. Nugent. Incremental View Maintenance over Array Data. In SIGMOD 2017. Google ScholarGoogle ScholarDigital LibraryDigital Library
  1. Distributed caching for processing raw arrays

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in
          • Published in

            cover image ACM Other conferences
            SSDBM '18: Proceedings of the 30th International Conference on Scientific and Statistical Database Management
            July 2018
            314 pages
            ISBN:9781450365055
            DOI:10.1145/3221269

            Copyright © 2018 ACM

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 9 July 2018

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • research-article

            Acceptance Rates

            SSDBM '18 Paper Acceptance Rate30of75submissions,40%Overall Acceptance Rate56of146submissions,38%

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader