skip to main content
10.1145/2020408.2020464acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article

NIMBLE: a toolkit for the implementation of parallel data mining and machine learning algorithms on mapreduce

Authors Info & Claims
Published:21 August 2011Publication History

ABSTRACT

In the last decade, advances in data collection and storage technologies have led to an increased interest in designing and implementing large-scale parallel algorithms for machine learning and data mining (ML-DM). Existing programming paradigms for expressing large-scale parallelism such as MapReduce (MR) and the Message Passing Interface (MPI) have been the de facto choices for implementing these ML-DM algorithms. The MR programming paradigm has been of particular interest as it gracefully handles large datasets and has built-in resilience against failures. However, the existing parallel programming paradigms are too low-level and ill-suited for implementing ML-DM algorithms. To address this deficiency, we present NIMBLE, a portable infrastructure that has been specifically designed to enable the rapid implementation of parallel ML-DM algorithms. The infrastructure allows one to compose parallel ML-DM algorithms using reusable (serial and parallel) building blocks that can be efficiently executed using MR and other parallel programming models; it currently runs on top of Hadoop, which is an open-source MR implementation. We show how NIMBLE can be used to realize scalable implementations of ML-DM algorithms and present a performance evaluation.

References

  1. Hadoop. http://hadoop.apache.org.Google ScholarGoogle Scholar
  2. HBase. http://hadoop.apache.org/hbase.Google ScholarGoogle Scholar
  3. Hive. http://hadoop.apache.org/hive.Google ScholarGoogle Scholar
  4. IBM Parallel Machine Learning Toolbox. http://www.alphaworks.ibm.com/tech/pml.Google ScholarGoogle Scholar
  5. Intel Threading Building Blocks. http://www.threadingbuildingblocks.org.Google ScholarGoogle Scholar
  6. JAQL. http://www.jaql.org.Google ScholarGoogle Scholar
  7. Mahout. http://lucene.apache.org/mahout/.Google ScholarGoogle Scholar
  8. MPI. http://www.mpi-forum.org.Google ScholarGoogle Scholar
  9. OpenMP. http://www.openmp.org.Google ScholarGoogle Scholar
  10. PThreads. https://computing.llnl.gov/tutorials/pthreads.Google ScholarGoogle Scholar
  11. R. Agrawal et al. Mining association rules between sets of items in large databases. ACM SIGMOD, 22(2), 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. L. Breiman. Bagging predictors. Machine Learning, 24(2), 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. C.-C. Chang and C.-J. Lin. LIBSVM: a library for support vector machines, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. C. Chu et al. Map-reduce for machine learning on multicore. In NIPS, 2007.Google ScholarGoogle Scholar
  15. W. Fan et al. A general framework for accurate and fast regression by data summarization in random decision trees. In ACM SIGKDD, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. A. Ghoting et al. Fast mining of distance-based outliers in high-dimensional datasets. DMKD, 16(3), 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. M. Isard et al. Dryad: distributed data-parallel programs from sequential building blocks. In SIGOPS Operating System Review, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. R. Jin and G. Agrawal. Shared Memory Parallelization of Data Mining Algorithms: Techniques, Programming Interface, and Performance. In SDM, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. P. Kambadur et al. PFunc: Modern Task Parallelism For Modern High Performance Computing. In SC, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Y. LeCun et al. Gradient-based learning applied to document recognition. In Intelligent Signal Processing, 2001.Google ScholarGoogle Scholar
  21. H. Li et al. Pfp: parallel fp-growth for query recommendation. In ACM RecSys, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. C. Olston et al. Pig latin: a not-so-foreign language for data processing. In ACM SIGMOD, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. B. Panda et al. PLANET: massively parallel learning of tree ensembles with MapReduce. Proceedings of the VLDB Endowment, 2(2), 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Y. Yu et al. DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language. In OSDI, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. NIMBLE: a toolkit for the implementation of parallel data mining and machine learning algorithms on mapreduce

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      KDD '11: Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
      August 2011
      1446 pages
      ISBN:9781450308137
      DOI:10.1145/2020408

      Copyright © 2011 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 21 August 2011

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate1,133of8,635submissions,13%

      Upcoming Conference

      KDD '24

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader