ABSTRACT
Enterprise customers increasingly require greater flexibility in the way they access and process their Big Data while at the same time they continue to request advanced analytics and access to diverse data sources. Yet customers also still require the robustness of enterprise class analytics for their mission-critical data. In this paper, we present our initial efforts toward a solution that satisfies the above requirements by integrating the HPE Vertica enterprise database with Apache Spark's open source big data computation engine. In particular, it enables fast, reliable transferring of data between Vertica and Spark; and deploying Machine Learning models created by Spark into Vertica for predictive analytics on Vertica data. This integration provides a fabric on which our customers get the best of both worlds: it extends Vertica's extensive SQL analytics capabilities with Spark's machine learning library (MLlib), giving Vertica users access to a wide range of ML functions; it also enables customers to leverage Spark as an advanced ETL engine for all data that require the guarantees offered by Vertica.
- Amazon Redshift. https://aws.amazon.com/redshift/.Google Scholar
- Amazon Simple Storage Service. https://aws.amazon.com/s3/.Google Scholar
- Apache Avro data serialization.Google Scholar
- DataStax Cassandra Connector. https://github.com/datastax/spark-cassandra-connector.Google Scholar
- HPE Vertica Connector for Apache Spark. https://saas.hpe.com/marketplace/big-data/hpe-vertica-connector-apache-spark.Google Scholar
- JavaPMML API. https://github.com/jpmml.Google Scholar
- PMML 4.1 general structure. http://dmg.org/pmml/v4-1/GeneralStructure.html.Google Scholar
- Redshift data source for Spark. https://github.com/databricks/spark-redshift.Google Scholar
- Spark MLlib. http://spark.apache.org/mllib/.Google Scholar
- Spark PMML model export. https://spark.apache.org/docs/latest/mllib-pmml-model-export.html.Google Scholar
- J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. In OSDI, 2004. Google ScholarDigital Library
- A. Lakshman and P. Malik. Cassandra: A decentralized structured storage system. SIGOPS Oper. Syst. Rev., 44(2), Apr. 2010. Google ScholarDigital Library
- A. Lamb, M. Fuller, R. Varadarajan, N. Tran, B. Vandiver, L. Doshi, and C. Bear. The Vertica analytic database: C-store 7 years later. In VLDB, volume 5, 2012. Google ScholarDigital Library
- S. Prasad, A. Fard, V. Gupta, J. Martinez, J. LeFevre, V. Xu, M. Hsu, and I. Roy. Large-scale predictive analytics in Vertica: Fast data transfer, distributed model creation, and in-database prediction. In SIGMOD, 2015. Google ScholarDigital Library
- M. Stonebraker, D. J. Abadi, A. Batkin, X. Chen, M. Cherniack, M. Ferreira, E. Lau, A. Lin, S. Madden, E. O'Neil, P. O'Neil, A. Rasin, N. Tran, and S. Zdonik. C-store: A Column-oriented DBMS. In VLDB, 2005. Google ScholarDigital Library
- M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, and I. Stoica. Resilient Distributed Datasets: A fault-tolerant abstraction for in-memory cluster computing. In NSDI, 2012. Google ScholarDigital Library
- M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica. Spark: Cluster computing with working sets. In Proceedings of the 2nd USENIX Workshop on Hot Topics in Cloud Computing, 2010. Google ScholarDigital Library
Index Terms
- Building the Enterprise Fabric for Big Data with Vertica and Spark Integration
Recommendations
A Spark-Based Big Data Platform for Massive Remote Sensing Data Processing
ICDS 2015: Proceedings of the Second International Conference on Data Science - Volume 9208With the fast development of remote sensing techniques, the volume of acquired data grows exponentially. This brings a big challenge to process massive remote sensing data. In the paper, an in-memory computing framework is proposed to address this ...
Building Big Data Processing and Visualization Pipeline through Apache Zeppelin
PEARC '18: Proceedings of the Practice and Experience on Advanced Research ComputingBig data analytics pipeline becomes popular for large volume data processing, Apache Zeppelin provides an integrated environment for data ingestion, data discovery, data analytics and data visualization and collaboration with an extended framework which ...
Comments