research-article

Building the Enterprise Fabric for Big Data with Vertica and Spark Integration

Authors:
Jeff LeFevre

HPE Vertica, Sunnyvale, CA, USA

HPE Vertica, Sunnyvale, CA, USA
View Profile

,
Rui Liu

HPE Vertica, Sunnyvale, CA, USA

HPE Vertica, Sunnyvale, CA, USA
View Profile

,
Cornelio Inigo

HPE Vertica, Sunnyvale, CA, USA

HPE Vertica, Sunnyvale, CA, USA
View Profile

,
Lupita Paz

HPE Vertica, Sunnyvale, CA, USA

HPE Vertica, Sunnyvale, CA, USA
View Profile

,
Edward Ma

HPE Vertica, Sunnyvale, CA, USA

HPE Vertica, Sunnyvale, CA, USA
View Profile

,
Malu Castellanos

HPE Vertica, Sunnyvale, CA, USA

HPE Vertica, Sunnyvale, CA, USA
View Profile

,
Meichun Hsu

HPE Vertica, Sunnyvale, CA, USA

HPE Vertica, Sunnyvale, CA, USA
View Profile

SIGMOD '16: Proceedings of the 2016 International Conference on Management of DataJune 2016Pages 63–75https://doi.org/10.1145/2882903.2903744

Published:14 June 2016Publication History

SIGMOD '16: Proceedings of the 2016 International Conference on Management of Data

Pages 63–75

ABSTRACT

Enterprise customers increasingly require greater flexibility in the way they access and process their Big Data while at the same time they continue to request advanced analytics and access to diverse data sources. Yet customers also still require the robustness of enterprise class analytics for their mission-critical data. In this paper, we present our initial efforts toward a solution that satisfies the above requirements by integrating the HPE Vertica enterprise database with Apache Spark's open source big data computation engine. In particular, it enables fast, reliable transferring of data between Vertica and Spark; and deploying Machine Learning models created by Spark into Vertica for predictive analytics on Vertica data. This integration provides a fabric on which our customers get the best of both worlds: it extends Vertica's extensive SQL analytics capabilities with Spark's machine learning library (MLlib), giving Vertica users access to a wide range of ML functions; it also enables customers to leverage Spark as an advanced ETL engine for all data that require the guarantees offered by Vertica.

References

Amazon Redshift. https://aws.amazon.com/redshift/.Google Scholar
Amazon Simple Storage Service. https://aws.amazon.com/s3/.Google Scholar
Apache Avro data serialization.Google Scholar
DataStax Cassandra Connector. https://github.com/datastax/spark-cassandra-connector.Google Scholar
HPE Vertica Connector for Apache Spark. https://saas.hpe.com/marketplace/big-data/hpe-vertica-connector-apache-spark.Google Scholar
JavaPMML API. https://github.com/jpmml.Google Scholar
PMML 4.1 general structure. http://dmg.org/pmml/v4-1/GeneralStructure.html.Google Scholar
Redshift data source for Spark. https://github.com/databricks/spark-redshift.Google Scholar
Spark MLlib. http://spark.apache.org/mllib/.Google Scholar
Spark PMML model export. https://spark.apache.org/docs/latest/mllib-pmml-model-export.html.Google Scholar
J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. In OSDI, 2004. Google ScholarDigital Library
A. Lakshman and P. Malik. Cassandra: A decentralized structured storage system. SIGOPS Oper. Syst. Rev., 44(2), Apr. 2010. Google ScholarDigital Library
A. Lamb, M. Fuller, R. Varadarajan, N. Tran, B. Vandiver, L. Doshi, and C. Bear. The Vertica analytic database: C-store 7 years later. In VLDB, volume 5, 2012. Google ScholarDigital Library
S. Prasad, A. Fard, V. Gupta, J. Martinez, J. LeFevre, V. Xu, M. Hsu, and I. Roy. Large-scale predictive analytics in Vertica: Fast data transfer, distributed model creation, and in-database prediction. In SIGMOD, 2015. Google ScholarDigital Library
M. Stonebraker, D. J. Abadi, A. Batkin, X. Chen, M. Cherniack, M. Ferreira, E. Lau, A. Lin, S. Madden, E. O'Neil, P. O'Neil, A. Rasin, N. Tran, and S. Zdonik. C-store: A Column-oriented DBMS. In VLDB, 2005. Google ScholarDigital Library
M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, and I. Stoica. Resilient Distributed Datasets: A fault-tolerant abstraction for in-memory cluster computing. In NSDI, 2012. Google ScholarDigital Library
M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica. Spark: Cluster computing with working sets. In Proceedings of the 2nd USENIX Workshop on Hot Topics in Cloud Computing, 2010. Google ScholarDigital Library

Index Terms

Building the Enterprise Fabric for Big Data with Vertica and Spark Integration
1. Applied computing
  1. Enterprise computing
    1. Enterprise interoperability
      1. Information integration and interoperability

Recommendations

A Spark-Based Big Data Platform for Massive Remote Sensing Data Processing
ICDS 2015: Proceedings of the Second International Conference on Data Science - Volume 9208

With the fast development of remote sensing techniques, the volume of acquired data grows exponentially. This brings a big challenge to process massive remote sensing data. In the paper, an in-memory computing framework is proposed to address this ...
Read More
Big data

We use structuralism and functionalism paradigms to analyze the origins of big data applications.Current trends and sources of big data.Processing technologies, methods and analysis techniques for big data are compared in detail.We analyze major ...
Read More
Building Big Data Processing and Visualization Pipeline through Apache Zeppelin
PEARC '18: Proceedings of the Practice and Experience on Advanced Research Computing

Big data analytics pipeline becomes popular for large volume data processing, Apache Zeppelin provides an integrated environment for data ingestion, data discovery, data analytics and data visualization and collaboration with an extended framework which ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGMOD '16: Proceedings of the 2016 International Conference on Management of Data
June 2016
2300 pages
ISBN:9781450335317
DOI:10.1145/2882903
General Chairs:
Fatma Özcan
IBM Research, USA
,
Georgia Koutrika
HP Labs, USA
,
Program Chair:
Sam Madden
Massachusetts Institute of Technology, USA
Copyright © 2016 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 14 June 2016
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
PMML
analytics
big data
connector
database
spark
vertica
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate785of4,003submissions,20%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 4
  Total Citations
  View Citations
- 874
  Total Downloads
- Downloads (Last 12 months)12
- Downloads (Last 6 weeks)4
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Building the Enterprise Fabric for Big Data with Vertica and Spark Integration

SIGMOD '16: Proceedings of the 2016 International Conference on Management of Data

ABSTRACT

References

Cited By

Index Terms

Recommendations

A Spark-Based Big Data Platform for Massive Remote Sensing Data Processing

Big data

Building Big Data Processing and Visualization Pipeline through Apache Zeppelin

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Building the Enterprise Fabric for Big Data with Vertica and Spark Integration

SIGMOD '16: Proceedings of the 2016 International Conference on Management of Data

ABSTRACT

References

Cited By

Index Terms

Recommendations

A Spark-Based Big Data Platform for Massive Remote Sensing Data Processing

Big data

Building Big Data Processing and Visualization Pipeline through Apache Zeppelin

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media