ABSTRACT
Obtaining the best accuracy in machine learning usually requires carefully tuning learning algorithm parameters for each problem. Parameter optimization is computationally challenging for learning methods with many hyperparameters. In this paper we show that MapReduce Clusters are particularly well suited for parallel parameter optimization. We use MapReduce to optimize regularization parameters for boosted trees and random forests on several text problems: three retrieval ranking problems and a Wikipedia vandalism problem. We show how model accuracy improves as a function of the percent of parameter space explored, that accuracy can be hurt by exploring parameter space too aggressively, and that there can be significant interaction between parameters that appear to be independent. Our results suggest that MapReduce is a two-edged sword: it makes parameter optimization feasible on a massive scale that would have been unimaginable just a few years ago, but also creates a new opportunity for overfitting that can reduce accuracy and lead to inferior learning parameters.
- Microsoft learning to rank datasets. http://research.microsoft.com/en-us/projects/mslr/.Google Scholar
- L. Breiman. Random forests. Machine Learning, 45:5--32, 2001. 10.1023/A:1010933404324. Google ScholarDigital Library
- N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer. Smote: synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16:321--357, June 2002. Google ScholarCross Ref
- B. Chen, Liaw. Using random forest to learn imbalanced data. Technical report, Stanford, 2004.Google Scholar
- I. Czogiel, K. Luebke, and C. Weihs. Response surface methodology for optimizing hyper parameters. Technical report, Universitat Dortmund Fachbereich Statistik, 2005.Google Scholar
- J. Dean and S. Ghemawat. Mapreduce: simplified data processing on large clusters. Communications of ACM, 51:107--113, January 2008. Google ScholarDigital Library
- T. Eitrich and B. Lang. Efficient optimization of support vector machine learning parameters for unbalanced datasets. Journal of Computational and Applied Mathematics, 196:425--436, November 2006. Google ScholarDigital Library
- J. H. Friedman. Stochastic gradient boosting. Technical report, Technical report, Dept. Statistics, Stanford Univ., 1999.Google Scholar
- J. H. Friedman. Greedy function approximation: A gradient boosting machine. Annals of Statistics, 29:1189--1232, 2000.Google ScholarCross Ref
- S. Hido and H. Kashima. Roughly balanced bagging for imbalanced data. In SIAM Data Mining, pages 143--152, 2008.Google ScholarCross Ref
- K. Järvelin and J. Kekäläinen. IR evaluation methods for retrieving highly relevant documents. In SIGIR '00: Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval, pages 41--48, New York, NY, USA, 2000. ACM. Google ScholarDigital Library
- H. Larochelle, D. Erhan, A. Courville, J. Bergstra, and Y. Bengio. An empirical evaluation of deep architectures on problems with many factors of variation. In Proceedings of the 24th international conference on Machine learning, ICML '07, pages 473--480, New York, NY, USA, 2007. ACM. Google ScholarDigital Library
- A. Nareyek. Choosing search heuristics by non-stationary reinforcement learning. Applied Optimization, 86:523--544, 2003.Google ScholarCross Ref
- M. Postema, T. Menzies, and X. Wu. A decision support tool for tuning parameters in a machine learning algorithm. In PACES/SPICIS'97 Proceedings, pages 227--235, 1997.Google Scholar
- M. Potthast, A. Barrón-Cedeño, A. Eiselt, B. Stein, and P. Rosso. Overview of the 2nd international competition on plagiarism detection. In Proceedings of the CLEF'10 Workshop on Uncovering Plagiarism, Authorship and Social Software Misuse, 2010.Google Scholar
- F. Poulet. Multi-way Distributed SVM algorithms. In Parallel and Distributed computing for Machine Learning. In conjunction with the 14th European Conference on Machine Learning (ECML'03) and 7th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD'03), Cavtat-Dubrovnik, Croatia, September 2003.Google Scholar
- T. Qin, T.-Y. Liu, J. Xu, and H. Li. LETOR: A benchmark collection for research on learning to rank for information retrieval. Information Retrieval, 13:346--374, 2010. 10.1007/s10791-009-9123-y. Google ScholarDigital Library
- C. C. Skiścim and B. L. Golden. Optimization by simulated annealing: A preliminary computational study for the tsp. In Proceedings of the 15th conference on Winter Simulation - Volume 2, WSC '83, pages 523--535, Piscataway, NJ, USA, 1983. IEEE Press. Google ScholarDigital Library
- G. M. Weiss. Mining with rarity: a unifying framework. SIGKDD Explor. Newsl., 6:7--19, June 2004. Google ScholarDigital Library
- Q. Wu, C. Burges, K. Svore, and J. Gao. Ranking, boosting and model adaptation. Technical report, Microsoft Technical Report MSR-TR-2008-109, 2008.Google Scholar
Index Terms
- Distributed tuning of machine learning algorithms using MapReduce Clusters
Recommendations
Automatic tuning of MPI runtime parameter settings by using machine learning
CF '10: Proceedings of the 7th ACM international conference on Computing frontiersMPI implementations provide several hundred runtime parameters that can be tuned for performance improvement. The ideal parameter setting does not only depend on the target multiprocessor architecture but also on the application, its problem and ...
The impact of parameter tuning on software effort estimation using learning machines
PROMISE '13: Proceedings of the 9th International Conference on Predictive Models in Software EngineeringBackground: The use of machine learning approaches for software effort estimation (SEE) has been studied for more than a decade. Most studies performed comparisons of different learning machines on a number of data sets. However, most learning machines ...
Predator — An experience guided configuration optimizer for Hadoop MapReduce
CLOUDCOM '12: Proceedings of the 2012 IEEE 4th International Conference on Cloud Computing Technology and Science (CloudCom)MapReduce is a distributed computing programming framework which provides an effective solution to the data processing challenge. As an open-source implementation of MapReduce, Hadoop has been widely used in practice. The performance of Hadoop MapReduce ...
Comments