research-article

Distributed tuning of machine learning algorithms using MapReduce Clusters

Authors:
Yasser Ganjisaffar

University of California, Irvine, Irvine, CA

University of California, Irvine, Irvine, CA
View Profile

,
Thomas Debeauvais

University of California, Irvine, Irvine, CA

University of California, Irvine, Irvine, CA
View Profile

,
Sara Javanmardi

University of California, Irvine, Irvine, CA

University of California, Irvine, Irvine, CA
View Profile

,
Rich Caruana

Microsoft Research Redmond, WA

Microsoft Research Redmond, WA
View Profile

,
Cristina Videira Lopes

University of California, Irvine, Irvine, CA

University of California, Irvine, Irvine, CA
View Profile

LDMTA '11: Proceedings of the Third Workshop on Large Scale Data Mining: Theory and ApplicationsAugust 2011Article No.: 2Pages 1–8https://doi.org/10.1145/2002945.2002947

Published:21 August 2011Publication History

LDMTA '11: Proceedings of the Third Workshop on Large Scale Data Mining: Theory and Applications

Pages 1–8

ABSTRACT

Obtaining the best accuracy in machine learning usually requires carefully tuning learning algorithm parameters for each problem. Parameter optimization is computationally challenging for learning methods with many hyperparameters. In this paper we show that MapReduce Clusters are particularly well suited for parallel parameter optimization. We use MapReduce to optimize regularization parameters for boosted trees and random forests on several text problems: three retrieval ranking problems and a Wikipedia vandalism problem. We show how model accuracy improves as a function of the percent of parameter space explored, that accuracy can be hurt by exploring parameter space too aggressively, and that there can be significant interaction between parameters that appear to be independent. Our results suggest that MapReduce is a two-edged sword: it makes parameter optimization feasible on a massive scale that would have been unimaginable just a few years ago, but also creates a new opportunity for overfitting that can reduce accuracy and lead to inferior learning parameters.

References

Microsoft learning to rank datasets. http://research.microsoft.com/en-us/projects/mslr/.Google Scholar
L. Breiman. Random forests. Machine Learning, 45:5--32, 2001. 10.1023/A:1010933404324. Google ScholarDigital Library
N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer. Smote: synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16:321--357, June 2002. Google ScholarCross Ref
B. Chen, Liaw. Using random forest to learn imbalanced data. Technical report, Stanford, 2004.Google Scholar
I. Czogiel, K. Luebke, and C. Weihs. Response surface methodology for optimizing hyper parameters. Technical report, Universitat Dortmund Fachbereich Statistik, 2005.Google Scholar
J. Dean and S. Ghemawat. Mapreduce: simplified data processing on large clusters. Communications of ACM, 51:107--113, January 2008. Google ScholarDigital Library
T. Eitrich and B. Lang. Efficient optimization of support vector machine learning parameters for unbalanced datasets. Journal of Computational and Applied Mathematics, 196:425--436, November 2006. Google ScholarDigital Library
J. H. Friedman. Stochastic gradient boosting. Technical report, Technical report, Dept. Statistics, Stanford Univ., 1999.Google Scholar
J. H. Friedman. Greedy function approximation: A gradient boosting machine. Annals of Statistics, 29:1189--1232, 2000.Google ScholarCross Ref
S. Hido and H. Kashima. Roughly balanced bagging for imbalanced data. In SIAM Data Mining, pages 143--152, 2008.Google ScholarCross Ref
K. Järvelin and J. Kekäläinen. IR evaluation methods for retrieving highly relevant documents. In SIGIR '00: Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval, pages 41--48, New York, NY, USA, 2000. ACM. Google ScholarDigital Library
H. Larochelle, D. Erhan, A. Courville, J. Bergstra, and Y. Bengio. An empirical evaluation of deep architectures on problems with many factors of variation. In Proceedings of the 24th international conference on Machine learning, ICML '07, pages 473--480, New York, NY, USA, 2007. ACM. Google ScholarDigital Library
A. Nareyek. Choosing search heuristics by non-stationary reinforcement learning. Applied Optimization, 86:523--544, 2003.Google ScholarCross Ref
M. Postema, T. Menzies, and X. Wu. A decision support tool for tuning parameters in a machine learning algorithm. In PACES/SPICIS'97 Proceedings, pages 227--235, 1997.Google Scholar
M. Potthast, A. Barrón-Cedeño, A. Eiselt, B. Stein, and P. Rosso. Overview of the 2nd international competition on plagiarism detection. In Proceedings of the CLEF'10 Workshop on Uncovering Plagiarism, Authorship and Social Software Misuse, 2010.Google Scholar
F. Poulet. Multi-way Distributed SVM algorithms. In Parallel and Distributed computing for Machine Learning. In conjunction with the 14th European Conference on Machine Learning (ECML'03) and 7th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD'03), Cavtat-Dubrovnik, Croatia, September 2003.Google Scholar
T. Qin, T.-Y. Liu, J. Xu, and H. Li. LETOR: A benchmark collection for research on learning to rank for information retrieval. Information Retrieval, 13:346--374, 2010. 10.1007/s10791-009-9123-y. Google ScholarDigital Library
C. C. Skiścim and B. L. Golden. Optimization by simulated annealing: A preliminary computational study for the tsp. In Proceedings of the 15th conference on Winter Simulation - Volume 2, WSC '83, pages 523--535, Piscataway, NJ, USA, 1983. IEEE Press. Google ScholarDigital Library
G. M. Weiss. Mining with rarity: a unifying framework. SIGKDD Explor. Newsl., 6:7--19, June 2004. Google ScholarDigital Library
Q. Wu, C. Burges, K. Svore, and J. Gao. Ranking, boosting and model adaptation. Technical report, Microsoft Technical Report MSR-TR-2008-109, 2008.Google Scholar

Index Terms

Distributed tuning of machine learning algorithms using MapReduce Clusters
1. Information systems
  1. Information systems applications

Recommendations

Automatic tuning of MPI runtime parameter settings by using machine learning
CF '10: Proceedings of the 7th ACM international conference on Computing frontiers

MPI implementations provide several hundred runtime parameters that can be tuned for performance improvement. The ideal parameter setting does not only depend on the target multiprocessor architecture but also on the application, its problem and ...
Read More
The impact of parameter tuning on software effort estimation using learning machines
PROMISE '13: Proceedings of the 9th International Conference on Predictive Models in Software Engineering

Background: The use of machine learning approaches for software effort estimation (SEE) has been studied for more than a decade. Most studies performed comparisons of different learning machines on a number of data sets. However, most learning machines ...
Read More
Predator — An experience guided configuration optimizer for Hadoop MapReduce
CLOUDCOM '12: Proceedings of the 2012 IEEE 4th International Conference on Cloud Computing Technology and Science (CloudCom)

MapReduce is a distributed computing programming framework which provides an effective solution to the data processing challenge. As an open-source implementation of MapReduce, Hadoop has been widely used in practice. The performance of Hadoop MapReduce ...
Read More

Reviews

Reviewer: Amrinder Arora

While machine learning algorithms have been around for a very long time, they invariably have a human component in the form of tuning-that is, finding the right values for parameters specific to the training set. Sometimes this can take a rather long time, because the tuning step needs to be repeated multiple times and each step takes a long time. MapReduce and cloud technologies-the power of distributed processing and the option of massively scaling the hardware using the cloud architecture-can make that step take less time. This paper attempts to do exactly that by presenting some ideas on tuning machine learning algorithms by distributing the work using MapReduce. The authors consider two different machine learning tasks. The first task is related to the ranking of results; the authors consider the LambdaMART algorithm and the NDCG@k evaluation metric for this task. The second is a binary classification task related to detecting vandalistic edits in Wikipedia. The authors consider a roughly balanced random forest (RBRF) algorithm and area under the curve (AUC) evaluation metric for this task. These are two specific-but practical and important-contributions of this work. The results indicate some progress-at least in the context of the specific evaluation metrics. Using MapReduce, the authors present ideas on how the tuning steps can be shortened, thereby saving practitioners countless hours. However, the disconnect between the two problems and their corresponding algorithms and evaluation metrics is hard to miss. The only thread that ties them together is that they are both machine learning algorithms. In that sense, this paper is an amalgam of two mini-papers, and the thread that ties them together is rather weak. For example, it is unclear that the presented work would be useful for the same problem and the same algorithm, even if the evaluation metric were simply changed. At the least, no results are presented that suggest this. If the two insights are simply that MapReduce is a good framework for distributing and harnessing the almost infinite computing power of the cloud and that machine learning algorithms are good candidates for using the MapReduce framework, then that much is accepted without any reservations. Any further general insights on this matter are not presented. Online Computing Reviews Service

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in

LDMTA '11: Proceedings of the Third Workshop on Large Scale Data Mining: Theory and Applications
August 2011
45 pages
ISBN:9781450308441
DOI:10.1145/2002945

Copyright © 2011 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 21 August 2011
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
MapReduce
hyper-parameter
machine learning
optimization
tuning
Qualifiers
- research-article
Conference
Upcoming Conference
KDD '24

Sponsor:

sigkdd

sigkdd

The 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 25 - 29, 2024

Barcelona , Spain
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 14
  Total Citations
  View Citations
- 777
  Total Downloads
- Downloads (Last 12 months)8
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Distributed tuning of machine learning algorithms using MapReduce Clusters

LDMTA '11: Proceedings of the Third Workshop on Large Scale Data Mining: Theory and Applications

ABSTRACT

References

Cited By

Index Terms

Recommendations

Automatic tuning of MPI runtime parameter settings by using machine learning

The impact of parameter tuning on software effort estimation using learning machines

Predator — An experience guided configuration optimizer for Hadoop MapReduce

Reviews

Access critical reviews of Computing literature here

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Distributed tuning of machine learning algorithms using MapReduce Clusters

LDMTA '11: Proceedings of the Third Workshop on Large Scale Data Mining: Theory and Applications

ABSTRACT

References

Cited By

Index Terms

Recommendations

Automatic tuning of MPI runtime parameter settings by using machine learning

The impact of parameter tuning on software effort estimation using learning machines

Predator — An experience guided configuration optimizer for Hadoop MapReduce

Reviews

Access critical reviews of Computing literature here

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media