skip to main content
10.1145/2002945.2002947acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article

Distributed tuning of machine learning algorithms using MapReduce Clusters

Published:21 August 2011Publication History

ABSTRACT

Obtaining the best accuracy in machine learning usually requires carefully tuning learning algorithm parameters for each problem. Parameter optimization is computationally challenging for learning methods with many hyperparameters. In this paper we show that MapReduce Clusters are particularly well suited for parallel parameter optimization. We use MapReduce to optimize regularization parameters for boosted trees and random forests on several text problems: three retrieval ranking problems and a Wikipedia vandalism problem. We show how model accuracy improves as a function of the percent of parameter space explored, that accuracy can be hurt by exploring parameter space too aggressively, and that there can be significant interaction between parameters that appear to be independent. Our results suggest that MapReduce is a two-edged sword: it makes parameter optimization feasible on a massive scale that would have been unimaginable just a few years ago, but also creates a new opportunity for overfitting that can reduce accuracy and lead to inferior learning parameters.

References

  1. Microsoft learning to rank datasets. http://research.microsoft.com/en-us/projects/mslr/.Google ScholarGoogle Scholar
  2. L. Breiman. Random forests. Machine Learning, 45:5--32, 2001. 10.1023/A:1010933404324. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer. Smote: synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16:321--357, June 2002. Google ScholarGoogle ScholarCross RefCross Ref
  4. B. Chen, Liaw. Using random forest to learn imbalanced data. Technical report, Stanford, 2004.Google ScholarGoogle Scholar
  5. I. Czogiel, K. Luebke, and C. Weihs. Response surface methodology for optimizing hyper parameters. Technical report, Universitat Dortmund Fachbereich Statistik, 2005.Google ScholarGoogle Scholar
  6. J. Dean and S. Ghemawat. Mapreduce: simplified data processing on large clusters. Communications of ACM, 51:107--113, January 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. T. Eitrich and B. Lang. Efficient optimization of support vector machine learning parameters for unbalanced datasets. Journal of Computational and Applied Mathematics, 196:425--436, November 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. J. H. Friedman. Stochastic gradient boosting. Technical report, Technical report, Dept. Statistics, Stanford Univ., 1999.Google ScholarGoogle Scholar
  9. J. H. Friedman. Greedy function approximation: A gradient boosting machine. Annals of Statistics, 29:1189--1232, 2000.Google ScholarGoogle ScholarCross RefCross Ref
  10. S. Hido and H. Kashima. Roughly balanced bagging for imbalanced data. In SIAM Data Mining, pages 143--152, 2008.Google ScholarGoogle ScholarCross RefCross Ref
  11. K. Järvelin and J. Kekäläinen. IR evaluation methods for retrieving highly relevant documents. In SIGIR '00: Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval, pages 41--48, New York, NY, USA, 2000. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. H. Larochelle, D. Erhan, A. Courville, J. Bergstra, and Y. Bengio. An empirical evaluation of deep architectures on problems with many factors of variation. In Proceedings of the 24th international conference on Machine learning, ICML '07, pages 473--480, New York, NY, USA, 2007. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. A. Nareyek. Choosing search heuristics by non-stationary reinforcement learning. Applied Optimization, 86:523--544, 2003.Google ScholarGoogle ScholarCross RefCross Ref
  14. M. Postema, T. Menzies, and X. Wu. A decision support tool for tuning parameters in a machine learning algorithm. In PACES/SPICIS'97 Proceedings, pages 227--235, 1997.Google ScholarGoogle Scholar
  15. M. Potthast, A. Barrón-Cedeño, A. Eiselt, B. Stein, and P. Rosso. Overview of the 2nd international competition on plagiarism detection. In Proceedings of the CLEF'10 Workshop on Uncovering Plagiarism, Authorship and Social Software Misuse, 2010.Google ScholarGoogle Scholar
  16. F. Poulet. Multi-way Distributed SVM algorithms. In Parallel and Distributed computing for Machine Learning. In conjunction with the 14th European Conference on Machine Learning (ECML'03) and 7th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD'03), Cavtat-Dubrovnik, Croatia, September 2003.Google ScholarGoogle Scholar
  17. T. Qin, T.-Y. Liu, J. Xu, and H. Li. LETOR: A benchmark collection for research on learning to rank for information retrieval. Information Retrieval, 13:346--374, 2010. 10.1007/s10791-009-9123-y. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. C. C. Skiścim and B. L. Golden. Optimization by simulated annealing: A preliminary computational study for the tsp. In Proceedings of the 15th conference on Winter Simulation - Volume 2, WSC '83, pages 523--535, Piscataway, NJ, USA, 1983. IEEE Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. G. M. Weiss. Mining with rarity: a unifying framework. SIGKDD Explor. Newsl., 6:7--19, June 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Q. Wu, C. Burges, K. Svore, and J. Gao. Ranking, boosting and model adaptation. Technical report, Microsoft Technical Report MSR-TR-2008-109, 2008.Google ScholarGoogle Scholar

Index Terms

  1. Distributed tuning of machine learning algorithms using MapReduce Clusters

    Recommendations

    Reviews

    Amrinder Arora

    While machine learning algorithms have been around for a very long time, they invariably have a human component in the form of tuning-that is, finding the right values for parameters specific to the training set. Sometimes this can take a rather long time, because the tuning step needs to be repeated multiple times and each step takes a long time. MapReduce and cloud technologies-the power of distributed processing and the option of massively scaling the hardware using the cloud architecture-can make that step take less time. This paper attempts to do exactly that by presenting some ideas on tuning machine learning algorithms by distributing the work using MapReduce. The authors consider two different machine learning tasks. The first task is related to the ranking of results; the authors consider the LambdaMART algorithm and the NDCG@k evaluation metric for this task. The second is a binary classification task related to detecting vandalistic edits in Wikipedia. The authors consider a roughly balanced random forest (RBRF) algorithm and area under the curve (AUC) evaluation metric for this task. These are two specific-but practical and important-contributions of this work. The results indicate some progress-at least in the context of the specific evaluation metrics. Using MapReduce, the authors present ideas on how the tuning steps can be shortened, thereby saving practitioners countless hours. However, the disconnect between the two problems and their corresponding algorithms and evaluation metrics is hard to miss. The only thread that ties them together is that they are both machine learning algorithms. In that sense, this paper is an amalgam of two mini-papers, and the thread that ties them together is rather weak. For example, it is unclear that the presented work would be useful for the same problem and the same algorithm, even if the evaluation metric were simply changed. At the least, no results are presented that suggest this. If the two insights are simply that MapReduce is a good framework for distributing and harnessing the almost infinite computing power of the cloud and that machine learning algorithms are good candidates for using the MapReduce framework, then that much is accepted without any reservations. Any further general insights on this matter are not presented. Online Computing Reviews Service

    Access critical reviews of Computing literature here

    Become a reviewer for Computing Reviews.

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      LDMTA '11: Proceedings of the Third Workshop on Large Scale Data Mining: Theory and Applications
      August 2011
      45 pages
      ISBN:9781450308441
      DOI:10.1145/2002945

      Copyright © 2011 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 21 August 2011

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Upcoming Conference

      KDD '24

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader