research-article

BoostVHT: Boosting Distributed Streaming Decision Trees

Authors:
Theodore Vasiloudis

RISE SICS, Stockholm, Sweden

RISE SICS, Stockholm, Sweden
View Profile

,
Foteini Beligianni

Royal Institute of Technology, KTH, Stockholm, Sweden

Royal Institute of Technology, KTH, Stockholm, Sweden
View Profile

,
Gianmarco De Francisci Morales

Qatar Computing Research Institute, Doha, Qatar

Qatar Computing Research Institute, Doha, Qatar
View Profile

CIKM '17: Proceedings of the 2017 ACM on Conference on Information and Knowledge ManagementNovember 2017Pages 899–908https://doi.org/10.1145/3132847.3132974

Published:06 November 2017Publication History

CIKM '17: Proceedings of the 2017 ACM on Conference on Information and Knowledge Management

Pages 899–908

ABSTRACT

Online boosting improves the accuracy of classifiers for unbounded streams of data by chaining them into an ensemble. Due to its sequential nature, boosting has proven hard to parallelize, even more so in the online setting. This paper introduces BoostVHT, a technique to parallelize online boosting algorithms. Our proposal leverages a recently-developed model-parallel learning algorithm for streaming decision trees as a base learner. This design allows to neatly separate the model boosting from its training. As a result, BoostVHT provides a flexible learning framework which can employ any existing online boosting algorithm, while at the same time it can leverage the computing power of modern parallel and distributed cluster environments. We implement our technique on Apache SAMOA, an open-source platform for mining big data streams that can be run on several distributed execution engines, and demonstrate order of magnitude speedups compared to the state-of-the-art.

References

Alina Beygelzimer, Satyen Kale, and Haipeng Luo. 2015. Optimal and Adaptive Algorithms for Online Boosting ICML, Vol. Vol. 37. 2323--2331. Google ScholarDigital Library
Albert Bifet and Gianmarco De Francisci Morales. 2014. Big Data Stream Learning with SAMOA. In ICDM. 1199--1202.Google Scholar
Albert Bifet, Gianmarco De Francisci Morales, Jesse Read, Geoff Holmes, and Bernhard Pfahringer. 2015. Efficient Online Evaluation of Big Data Stream Classifiers KDD. 59--68. Google ScholarDigital Library
Albert Bifet, Geoff Holmes, Richard Kirkby, and Bernhard Pfahringer. 2010. MOA: Massive online analysis. JMLR Vol. 11 (2010), 1601--1604. Google ScholarDigital Library
Leo Breiman. 1999. Pasting Small Votes for Classification in Large Databases and On-Line. Machine Learning, Vol. 36, 1--2 (1999), 85--103. Google ScholarDigital Library
Nitesh V. Chawla, Lawrence O. Hall, Kevin W. Bowyer, and W. Philip Kegelmeyer. 2004. Learning Ensembles from Bites: A Scalable and Accurate Approach. JMLR Vol. 5 (2004), 421--451. Google ScholarDigital Library
Shang-Tse Chen, Hsuan-Tien Lin, and Chi-Jen Lu. 2012. An Online Boosting Algorithm with Theoretical Justifications ICML. 1873--1880. Google ScholarDigital Library
Gianmarco De Francisci Morales. 2013. SAMOA: A Platform for Mining Big Data Streams. RAMSS Workshop @WWW'13. 777--778. Google ScholarDigital Library
Gianmarco De Francisci Morales and Albert Bifet. 2015. SAMOA: Scalable Advanced Massive Online Analysis. JMLR Vol. 16 (2015), 149--153. Google ScholarDigital Library
Pedro Domingos and Geoff Hulten. 2000. Mining high-speed data streams. In KDD. 71--80. Google ScholarDigital Library
Wei Fan, Salvatore J. Stolfo, and Junxin Zhang. 1999. The Application of AdaBoost for Distributed, Scalable and On-line Learning KDD. 362--366. Google ScholarDigital Library
Yoav Freund. 1995. Boosting a Weak Learning Algorithm by Majority. Information and Computation Vol. 121, 2 (1995), 256--285. Google ScholarDigital Library
Yoav Freund and Robert E Schapire. 1995. A desicion-theoretic generalization of on-line learning and an application to boosting EuroCOLT. 23--37. Google ScholarDigital Library
Nicolas Kourtellis, Gianmarco De Francisci Morales, Albert Bifet, and Arinto Murdopo. 2016. VHT: Vertical Hoeffding Tree. In BigData. 915--922.Google Scholar
Aleksandar Lazarevic and Zoran Obradovic. 2002. Boosting Algorithms for Parallel and Distributed Learning. Distributed and Parallel Databases Vol. 11, 2 (2002), 203--229. Google ScholarDigital Library
Nikunj C. Oza and Stuart Russell. 2001. Online Bagging and Boosting. In Artificial Intelligence and Statistics. 105--112.Google Scholar
Indranil Palit and Chandan K. Reddy. 2012. Scalable and Parallel Boosting with MapReduce. TKDE, Vol. 24, 10 (2012), 1904--1916. Google ScholarDigital Library
Jesse A. Reichler, Harlan D. Harris, and Michael A. Savchenko. 2004. Online Parallel Boosting. In AAAI. 366--371. Google ScholarDigital Library
Robert E Schapire. 1990. The strength of weak learnability. Machine learning, Vol. 5, 2 (1990), 197--227. Google ScholarDigital Library
Rocco A. Servedio. 2003. Smooth Boosting and Learning with Malicious Noise. JMLR Vol. 4 (2003), 633--648. Google ScholarDigital Library
Anh Thu Vu, Gianmarco De Francisci Morales, Jo ao Gama, and Albert Bifet. 2014. Distributed Adaptive Model Rules for Mining Big Data Streams BigData. 345--353.Google Scholar
Ji Zhu, Hui Zou, Saharon Rosset, and Trevor Hastie. 2009. Multi-class AdaBoost. Statistics and its Interface Vol. 2, 3 (2009), 349--360.Google Scholar

Index Terms

BoostVHT: Boosting Distributed Streaming Decision Trees
1. Computing methodologies
  1. Distributed computing methodologies
    1. Distributed algorithms
  2. Machine learning

Recommendations

Online Ensemble Learning: An Empirical Study

We study resource-limited online learning, motivated by the problem of conditional-branch outcome prediction in computer architecture. In particular, we consider (parallel) time and space-efficient ensemble learners for online settings, empirically ...
Read More
Using boosting to prune bagging ensembles

Boosting is used to determine the order in which classifiers are aggregated in a bagging ensemble. Early stopping in the aggregation of the classifiers in the ordered bagging ensemble allows the identification of subensembles that require less memory ...
Read More
Class-switching neural network ensembles

This article investigates the properties of class-switching ensembles composed of neural networks and compares them to class-switching ensembles of decision trees and to standard ensemble learning methods, such as bagging and boosting. In a class-...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
CIKM '17: Proceedings of the 2017 ACM on Conference on Information and Knowledge Management
November 2017
2604 pages
ISBN:9781450349185
DOI:10.1145/3132847
General Chairs:
Ee-Peng Lim
Singapore Management University, Singapore
,
Marianne Winslett
University of Illinois at Urbana-Champaign, USA, and Advanced Digital Sciences Center, Singapore
,
Program Chairs:
Mark Sanderson
RMIT, Australia
,
Ada Fu
Chinese University of Hong Kong, Hong Kong
,
Jimeng Sun
Georgia Tech, USA
,
Shane Culpepper
RMIT, Australia
,
Eric Lo
Chinese University of Hong Kong, Hong Kong
,
Joyce Ho
Emory University, USA
,
Debora Donato
Mix Tech, Inc., USA
,
Rakesh Agrawal
Data Insights Laboratories, USA
,
Yu Zheng
Microsoft Research Asia, China
,
Carlos Castillo
Qatar Computing Research Institute, Qatar
,
Aixin Sun
Nanyang Technological University, Singapore
,
Vincent S. Tseng
National Cheng Kung University, Taiwan
,
Chenliang Li
Wuhan University, China
Copyright © 2017 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 6 November 2017
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
boosting
decision trees
distributed systems
online learning
Qualifiers
- research-article
Conference

Acceptance Rates
CIKM '17 Paper Acceptance Rate171of855submissions,20%Overall Acceptance Rate1,861of8,427submissions,22%
More
Upcoming Conference
CIKM '24

Sponsor:

sigir

sigir

The 33rd ACM International Conference on Information and Knowledge Management

October 21 - 25, 2024

Boise , ID , USA
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 5
  Total Citations
  View Citations
- 175
  Total Downloads
- Downloads (Last 12 months)2
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

BoostVHT: Boosting Distributed Streaming Decision Trees

CIKM '17: Proceedings of the 2017 ACM on Conference on Information and Knowledge Management

ABSTRACT

References

Cited By

Index Terms

Recommendations

Online Ensemble Learning: An Empirical Study

Using boosting to prune bagging ensembles

Class-switching neural network ensembles