skip to main content
10.1145/2882903.2903744acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Building the Enterprise Fabric for Big Data with Vertica and Spark Integration

Published:14 June 2016Publication History

ABSTRACT

Enterprise customers increasingly require greater flexibility in the way they access and process their Big Data while at the same time they continue to request advanced analytics and access to diverse data sources. Yet customers also still require the robustness of enterprise class analytics for their mission-critical data. In this paper, we present our initial efforts toward a solution that satisfies the above requirements by integrating the HPE Vertica enterprise database with Apache Spark's open source big data computation engine. In particular, it enables fast, reliable transferring of data between Vertica and Spark; and deploying Machine Learning models created by Spark into Vertica for predictive analytics on Vertica data. This integration provides a fabric on which our customers get the best of both worlds: it extends Vertica's extensive SQL analytics capabilities with Spark's machine learning library (MLlib), giving Vertica users access to a wide range of ML functions; it also enables customers to leverage Spark as an advanced ETL engine for all data that require the guarantees offered by Vertica.

References

  1. Amazon Redshift. https://aws.amazon.com/redshift/.Google ScholarGoogle Scholar
  2. Amazon Simple Storage Service. https://aws.amazon.com/s3/.Google ScholarGoogle Scholar
  3. Apache Avro data serialization.Google ScholarGoogle Scholar
  4. DataStax Cassandra Connector. https://github.com/datastax/spark-cassandra-connector.Google ScholarGoogle Scholar
  5. HPE Vertica Connector for Apache Spark. https://saas.hpe.com/marketplace/big-data/hpe-vertica-connector-apache-spark.Google ScholarGoogle Scholar
  6. JavaPMML API. https://github.com/jpmml.Google ScholarGoogle Scholar
  7. PMML 4.1 general structure. http://dmg.org/pmml/v4-1/GeneralStructure.html.Google ScholarGoogle Scholar
  8. Redshift data source for Spark. https://github.com/databricks/spark-redshift.Google ScholarGoogle Scholar
  9. Spark MLlib. http://spark.apache.org/mllib/.Google ScholarGoogle Scholar
  10. Spark PMML model export. https://spark.apache.org/docs/latest/mllib-pmml-model-export.html.Google ScholarGoogle Scholar
  11. J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. In OSDI, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. A. Lakshman and P. Malik. Cassandra: A decentralized structured storage system. SIGOPS Oper. Syst. Rev., 44(2), Apr. 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. A. Lamb, M. Fuller, R. Varadarajan, N. Tran, B. Vandiver, L. Doshi, and C. Bear. The Vertica analytic database: C-store 7 years later. In VLDB, volume 5, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. S. Prasad, A. Fard, V. Gupta, J. Martinez, J. LeFevre, V. Xu, M. Hsu, and I. Roy. Large-scale predictive analytics in Vertica: Fast data transfer, distributed model creation, and in-database prediction. In SIGMOD, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. M. Stonebraker, D. J. Abadi, A. Batkin, X. Chen, M. Cherniack, M. Ferreira, E. Lau, A. Lin, S. Madden, E. O'Neil, P. O'Neil, A. Rasin, N. Tran, and S. Zdonik. C-store: A Column-oriented DBMS. In VLDB, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, and I. Stoica. Resilient Distributed Datasets: A fault-tolerant abstraction for in-memory cluster computing. In NSDI, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica. Spark: Cluster computing with working sets. In Proceedings of the 2nd USENIX Workshop on Hot Topics in Cloud Computing, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Building the Enterprise Fabric for Big Data with Vertica and Spark Integration

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      SIGMOD '16: Proceedings of the 2016 International Conference on Management of Data
      June 2016
      2300 pages
      ISBN:9781450335317
      DOI:10.1145/2882903

      Copyright © 2016 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 14 June 2016

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate785of4,003submissions,20%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader