ABSTRACT
Online boosting improves the accuracy of classifiers for unbounded streams of data by chaining them into an ensemble. Due to its sequential nature, boosting has proven hard to parallelize, even more so in the online setting. This paper introduces BoostVHT, a technique to parallelize online boosting algorithms. Our proposal leverages a recently-developed model-parallel learning algorithm for streaming decision trees as a base learner. This design allows to neatly separate the model boosting from its training. As a result, BoostVHT provides a flexible learning framework which can employ any existing online boosting algorithm, while at the same time it can leverage the computing power of modern parallel and distributed cluster environments. We implement our technique on Apache SAMOA, an open-source platform for mining big data streams that can be run on several distributed execution engines, and demonstrate order of magnitude speedups compared to the state-of-the-art.
- Alina Beygelzimer, Satyen Kale, and Haipeng Luo. 2015. Optimal and Adaptive Algorithms for Online Boosting ICML, Vol. Vol. 37. 2323--2331. Google ScholarDigital Library
- Albert Bifet and Gianmarco De Francisci Morales. 2014. Big Data Stream Learning with SAMOA. In ICDM. 1199--1202.Google Scholar
- Albert Bifet, Gianmarco De Francisci Morales, Jesse Read, Geoff Holmes, and Bernhard Pfahringer. 2015. Efficient Online Evaluation of Big Data Stream Classifiers KDD. 59--68. Google ScholarDigital Library
- Albert Bifet, Geoff Holmes, Richard Kirkby, and Bernhard Pfahringer. 2010. MOA: Massive online analysis. JMLR Vol. 11 (2010), 1601--1604. Google ScholarDigital Library
- Leo Breiman. 1999. Pasting Small Votes for Classification in Large Databases and On-Line. Machine Learning, Vol. 36, 1--2 (1999), 85--103. Google ScholarDigital Library
- Nitesh V. Chawla, Lawrence O. Hall, Kevin W. Bowyer, and W. Philip Kegelmeyer. 2004. Learning Ensembles from Bites: A Scalable and Accurate Approach. JMLR Vol. 5 (2004), 421--451. Google ScholarDigital Library
- Shang-Tse Chen, Hsuan-Tien Lin, and Chi-Jen Lu. 2012. An Online Boosting Algorithm with Theoretical Justifications ICML. 1873--1880. Google ScholarDigital Library
- Gianmarco De Francisci Morales. 2013. SAMOA: A Platform for Mining Big Data Streams. RAMSS Workshop @WWW'13. 777--778. Google ScholarDigital Library
- Gianmarco De Francisci Morales and Albert Bifet. 2015. SAMOA: Scalable Advanced Massive Online Analysis. JMLR Vol. 16 (2015), 149--153. Google ScholarDigital Library
- Pedro Domingos and Geoff Hulten. 2000. Mining high-speed data streams. In KDD. 71--80. Google ScholarDigital Library
- Wei Fan, Salvatore J. Stolfo, and Junxin Zhang. 1999. The Application of AdaBoost for Distributed, Scalable and On-line Learning KDD. 362--366. Google ScholarDigital Library
- Yoav Freund. 1995. Boosting a Weak Learning Algorithm by Majority. Information and Computation Vol. 121, 2 (1995), 256--285. Google ScholarDigital Library
- Yoav Freund and Robert E Schapire. 1995. A desicion-theoretic generalization of on-line learning and an application to boosting EuroCOLT. 23--37. Google ScholarDigital Library
- Nicolas Kourtellis, Gianmarco De Francisci Morales, Albert Bifet, and Arinto Murdopo. 2016. VHT: Vertical Hoeffding Tree. In BigData. 915--922.Google Scholar
- Aleksandar Lazarevic and Zoran Obradovic. 2002. Boosting Algorithms for Parallel and Distributed Learning. Distributed and Parallel Databases Vol. 11, 2 (2002), 203--229. Google ScholarDigital Library
- Nikunj C. Oza and Stuart Russell. 2001. Online Bagging and Boosting. In Artificial Intelligence and Statistics. 105--112.Google Scholar
- Indranil Palit and Chandan K. Reddy. 2012. Scalable and Parallel Boosting with MapReduce. TKDE, Vol. 24, 10 (2012), 1904--1916. Google ScholarDigital Library
- Jesse A. Reichler, Harlan D. Harris, and Michael A. Savchenko. 2004. Online Parallel Boosting. In AAAI. 366--371. Google ScholarDigital Library
- Robert E Schapire. 1990. The strength of weak learnability. Machine learning, Vol. 5, 2 (1990), 197--227. Google ScholarDigital Library
- Rocco A. Servedio. 2003. Smooth Boosting and Learning with Malicious Noise. JMLR Vol. 4 (2003), 633--648. Google ScholarDigital Library
- Anh Thu Vu, Gianmarco De Francisci Morales, Jo ao Gama, and Albert Bifet. 2014. Distributed Adaptive Model Rules for Mining Big Data Streams BigData. 345--353.Google Scholar
- Ji Zhu, Hui Zou, Saharon Rosset, and Trevor Hastie. 2009. Multi-class AdaBoost. Statistics and its Interface Vol. 2, 3 (2009), 349--360.Google Scholar
Index Terms
- BoostVHT: Boosting Distributed Streaming Decision Trees
Recommendations
Online Ensemble Learning: An Empirical Study
We study resource-limited online learning, motivated by the problem of conditional-branch outcome prediction in computer architecture. In particular, we consider (parallel) time and space-efficient ensemble learners for online settings, empirically ...
Using boosting to prune bagging ensembles
Boosting is used to determine the order in which classifiers are aggregated in a bagging ensemble. Early stopping in the aggregation of the classifiers in the ordered bagging ensemble allows the identification of subensembles that require less memory ...
Class-switching neural network ensembles
This article investigates the properties of class-switching ensembles composed of neural networks and compares them to class-switching ensembles of decision trees and to standard ensemble learning methods, such as bagging and boosting. In a class-...
Comments