skip to main content
10.1145/3132847.3132974acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article

BoostVHT: Boosting Distributed Streaming Decision Trees

Published:06 November 2017Publication History

ABSTRACT

Online boosting improves the accuracy of classifiers for unbounded streams of data by chaining them into an ensemble. Due to its sequential nature, boosting has proven hard to parallelize, even more so in the online setting. This paper introduces BoostVHT, a technique to parallelize online boosting algorithms. Our proposal leverages a recently-developed model-parallel learning algorithm for streaming decision trees as a base learner. This design allows to neatly separate the model boosting from its training. As a result, BoostVHT provides a flexible learning framework which can employ any existing online boosting algorithm, while at the same time it can leverage the computing power of modern parallel and distributed cluster environments. We implement our technique on Apache SAMOA, an open-source platform for mining big data streams that can be run on several distributed execution engines, and demonstrate order of magnitude speedups compared to the state-of-the-art.

References

  1. Alina Beygelzimer, Satyen Kale, and Haipeng Luo. 2015. Optimal and Adaptive Algorithms for Online Boosting ICML, Vol. Vol. 37. 2323--2331. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Albert Bifet and Gianmarco De Francisci Morales. 2014. Big Data Stream Learning with SAMOA. In ICDM. 1199--1202.Google ScholarGoogle Scholar
  3. Albert Bifet, Gianmarco De Francisci Morales, Jesse Read, Geoff Holmes, and Bernhard Pfahringer. 2015. Efficient Online Evaluation of Big Data Stream Classifiers KDD. 59--68. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Albert Bifet, Geoff Holmes, Richard Kirkby, and Bernhard Pfahringer. 2010. MOA: Massive online analysis. JMLR Vol. 11 (2010), 1601--1604. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Leo Breiman. 1999. Pasting Small Votes for Classification in Large Databases and On-Line. Machine Learning, Vol. 36, 1--2 (1999), 85--103. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Nitesh V. Chawla, Lawrence O. Hall, Kevin W. Bowyer, and W. Philip Kegelmeyer. 2004. Learning Ensembles from Bites: A Scalable and Accurate Approach. JMLR Vol. 5 (2004), 421--451. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Shang-Tse Chen, Hsuan-Tien Lin, and Chi-Jen Lu. 2012. An Online Boosting Algorithm with Theoretical Justifications ICML. 1873--1880. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Gianmarco De Francisci Morales. 2013. SAMOA: A Platform for Mining Big Data Streams. RAMSS Workshop @WWW'13. 777--778. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Gianmarco De Francisci Morales and Albert Bifet. 2015. SAMOA: Scalable Advanced Massive Online Analysis. JMLR Vol. 16 (2015), 149--153. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Pedro Domingos and Geoff Hulten. 2000. Mining high-speed data streams. In KDD. 71--80. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Wei Fan, Salvatore J. Stolfo, and Junxin Zhang. 1999. The Application of AdaBoost for Distributed, Scalable and On-line Learning KDD. 362--366. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Yoav Freund. 1995. Boosting a Weak Learning Algorithm by Majority. Information and Computation Vol. 121, 2 (1995), 256--285. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Yoav Freund and Robert E Schapire. 1995. A desicion-theoretic generalization of on-line learning and an application to boosting EuroCOLT. 23--37. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Nicolas Kourtellis, Gianmarco De Francisci Morales, Albert Bifet, and Arinto Murdopo. 2016. VHT: Vertical Hoeffding Tree. In BigData. 915--922.Google ScholarGoogle Scholar
  15. Aleksandar Lazarevic and Zoran Obradovic. 2002. Boosting Algorithms for Parallel and Distributed Learning. Distributed and Parallel Databases Vol. 11, 2 (2002), 203--229. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Nikunj C. Oza and Stuart Russell. 2001. Online Bagging and Boosting. In Artificial Intelligence and Statistics. 105--112.Google ScholarGoogle Scholar
  17. Indranil Palit and Chandan K. Reddy. 2012. Scalable and Parallel Boosting with MapReduce. TKDE, Vol. 24, 10 (2012), 1904--1916. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Jesse A. Reichler, Harlan D. Harris, and Michael A. Savchenko. 2004. Online Parallel Boosting. In AAAI. 366--371. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Robert E Schapire. 1990. The strength of weak learnability. Machine learning, Vol. 5, 2 (1990), 197--227. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Rocco A. Servedio. 2003. Smooth Boosting and Learning with Malicious Noise. JMLR Vol. 4 (2003), 633--648. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Anh Thu Vu, Gianmarco De Francisci Morales, Jo ao Gama, and Albert Bifet. 2014. Distributed Adaptive Model Rules for Mining Big Data Streams BigData. 345--353.Google ScholarGoogle Scholar
  22. Ji Zhu, Hui Zou, Saharon Rosset, and Trevor Hastie. 2009. Multi-class AdaBoost. Statistics and its Interface Vol. 2, 3 (2009), 349--360.Google ScholarGoogle Scholar

Index Terms

  1. BoostVHT: Boosting Distributed Streaming Decision Trees

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in
          • Published in

            cover image ACM Conferences
            CIKM '17: Proceedings of the 2017 ACM on Conference on Information and Knowledge Management
            November 2017
            2604 pages
            ISBN:9781450349185
            DOI:10.1145/3132847

            Copyright © 2017 ACM

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 6 November 2017

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • research-article

            Acceptance Rates

            CIKM '17 Paper Acceptance Rate171of855submissions,20%Overall Acceptance Rate1,861of8,427submissions,22%

            Upcoming Conference

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader