skip to main content
10.1145/1562868.1562870acmotherconferencesArticle/Chapter ViewAbstractPublication PagesecoopConference Proceedingsconference-collections
research-article

Multi-core parallelization in Clojure: a case study

Published:06 July 2009Publication History

ABSTRACT

In recent years, the demand for computational power in data mining applications has increased due to rapidly growing data sets. As a consequence, standard algorithms need to be parallelized for fast processing of the generated data sets. Unfortunately, most approaches for parallelizing algorithms require a careful software design and a deep knowledge about thread-safe programming. As a consequence they are hardly applicable for rapid prototyping of new algorithms. We outline the process of multi-core parallelization using Clojure, a new functional programming language utilizing the Java Virtual Machine (JVM) that does not require knowledge of thread-safe programming. We provide some benchmark results for our multi-core algorithm to demonstrate its computational power. The rationale behind Clojure is combining the industry-standard JVM with functional programming, immutable data structures, and a built-in concurrency support via software transactional memory. This makes it a suitable tool for parallelization and rapid prototyping in many areas. In this case study we present a multi-core parallel implementation of the k-means cluster algorithm. The multi-core algorithm shows an increase in computation speed up to a factor of 10 compared to R or network based parallelization.

References

  1. A.-R. Adl-Tabatabai, C. Kozyrakis, and B. Saha. Unlocking concurrency. ACM Queue, 4(10):24--33, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. S. Ben-David, D. Pál, and H. U. Simon. Stability of k-means clustering. In N. H. Bshouty and C. Gentile, editors, Conference on Learning Theory, volume 4539 of Lecture Notes in Artificial Intelligence, pages 20--34, Berlin, 2007. Springer. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. S. Ben-David, U. von Luxburg, and D. Pál. A sober look at clustering stability. In J. G. Carbonell and J. Siekmann, editors, Conference on Learning Theory, volume 4005 of Lecture Notes in Artificial Intelligence, pages 5--19, Berlin, 2006. Springer. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. P. A. Bernstein and N. Goodman. Concurrency control in distributed database systems. ACM Computing Surveys, 13(2):185--221, 1981. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. A. Bertoni and G. Valentini. Random projections for assessing gene expression cluster stability. In Proceedings of the IEEE-International Joint Conference on Neural Networks (IJCNN), volume 1, pages 149--154. IEEE Computer Society, 2005.Google ScholarGoogle ScholarCross RefCross Ref
  6. B. Chapman, G. Jost, and R. van der Pas. Using OpenMP: Portable Shared Memory Parallel Programming. MIT Press, Cambridge, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. E. Fowlkes and C. Mallows. A method for comparing two hierarchical clusterings. Journal of the American Statistical Association, 78(383):553--569, 1983.Google ScholarGoogle ScholarCross RefCross Ref
  8. W. Gropp, E. Lusk, and A. Skjellum. Using MPI: Portable Parallel Programming with the Message Passing Interface. MIT Press, Cambridge, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. S. Halloway. Programming Clojure. Pragmatic Programmers, Raleigh, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. J. Handl, J. Knowles, and D. Kell. Computational cluster validation in post-genomic data analysis. Bioinformatics, 21(15):3201--3212, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. J. Hill, M. Hambley, T. Forster, M. Mewissen, T. M. Sloan, F. Scharinger, A. Trew, and P. Ghazal. Sprint: A new parallel framework for r. BMC Bioinformatics, 9(558), 2008.Google ScholarGoogle Scholar
  12. L. Hubert and P. Arabie. Comparing partitions. Journal of Mathematical Classification, 2:193--218, 1985.Google ScholarGoogle ScholarCross RefCross Ref
  13. P. Jaccard. Nouvelles recherches sur la distribution florale. Bulletin de la Société Vaudoise des sciences naturelles, 44:223--270, 1908.Google ScholarGoogle Scholar
  14. A. Jain and R. Dubes. Algorithms for Clustering Data. Prentice Hall, New Jersey, 1988. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. A. K. Jain and J. V. Moreau. Bootstrap technique in cluster analysis. Pattern Recognition, 20(5):547--568, 1987. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. H. A. Kestler, A. Müller, F. Schwenker, T. Gress, T. Mattfeldt, and G. Palm. Cluster analysis of comparative genomic hybridization data. Lecture Notes NATO ASI: Aritificial Intelligence and Heuristic Methods for Bioinformatics, pages S--40, 2001. Abstract.Google ScholarGoogle Scholar
  17. D. Koenig, A. Glover, P. King, G. Laforge, and J. Skeet. Groovy in Action. Manning Publications Co., Greenwich, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. P. Kraj, A. Sharma, N. Garge, R. Podolsky, and R. A. McIndoe. Parakmeans: Implementation of a parallelized k-means algorithm suitable for general laboratory use. BMC Bioinformatics, 9(200), 2008.Google ScholarGoogle Scholar
  19. T. Lange, V. Roth, M. L. Braun, and J. M. Buhmann. Stability-based validation of clustering solutions. Neural Computation, 16(6):1299--1323, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. D. Lea. Concurrent Programming in Java: Design Principles and Patterns. Addison Wesley, Boston, 2nd edition, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. J. MacQueen. Some methods for classification and analysis of multivariate observations. In J. Neyman and L. L. Cam, editors, Proceedings of the 5th Berkeley Symposium on Math, Statistics and Probability, volume 1, pages 281--297, Berkely, 1967. University of California Press.Google ScholarGoogle Scholar
  22. C. C. Minh, J. Chung, C. Kozyrakis, and K. Olukotun. Stamp: Stanford transactional applications for multi-processing. In IISWC '08: Proceedings of the IEEE International Symposium on Workload Characterization, pages 35--46, Los Alamitos, 2008. IEEE Computer Society.Google ScholarGoogle Scholar
  23. M. Odersky, L. Spoon, and B. Venners. Programming in Scala. Artima, Mountain View, 2008.Google ScholarGoogle Scholar
  24. S. Peyton-Jones. Beautiful concurrency. In A. Oram and G. Wilson, editors, Beautiful code, chapter 24. O'Reilly, Sebastopol, 2007.Google ScholarGoogle Scholar
  25. R Development Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, 2009. ISBN 3-900051-07-0.Google ScholarGoogle Scholar
  26. R. Rajwar and J. Goodman. Transactional execution: Toward reliable, high-performance multithreading. IEEE Micro, 23(6):117--125, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. A. Rakhlin and A. Caponnetto. Stability of k-means clustering. In B. Schölkopf, J. C. Platt, and T. Hoffman, editors, Advances in Neural Information Processing Systems 19, pages 1121--128. MIT Press, Cambridge, 2007.Google ScholarGoogle Scholar
  28. W. Rand. Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association, 66:846--850, 1971.Google ScholarGoogle ScholarCross RefCross Ref
  29. N. Shavit and D. Touitou. Software transactional memory. In Proceedings of the 14th ACM Symposium on Principles of Distributed Computing, pages 204--213, New York, 1995. ACM Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. S. Thompson. Haskell: The Craft of Functional Programming. Addison Wesley, Boston, 2nd edition, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Multi-core parallelization in Clojure: a case study

      Recommendations

      Reviews

      Arthur Gittleman

      Presented at the 2009 European Lisp Workshop (ELW), this paper shows, via a case study, how Clojure, a new language in the Lisp family, can be used for the rapid prototyping of multicore data mining applications. Clojure is a functional Java Virtual Machine (JVM) language with concurrency support that uses software transactional memory. It is easier to implement multicore parallelization in Clojure than in languages that require "a deep knowledge about thread-safe programming." This makes Clojure a good candidate for the "rapid prototyping of new algorithms" and for researchers who are more interested in applications than in the complexities of concurrent programming. Section 1 introduces parallel programming concepts, and Section 2 briefly introduces Clojure, without any details. Section 3 describes a k -means clustering case study. Data can be grouped into clusters, where each cluster contains a group of data items so that the items in the cluster are more similar to each other than to data in other clusters. Although there are actually several algorithms used for k -means clustering, "The k -means Clustering Algorithm" is a subsection of Section 3. In Section 4, "Experiments and Results," the first results used simulated data without clustering structure, with two datasets: a smaller set, with 10,000 cases and 100 dimensions, and a larger one, with 1,000,000 cases and 200 dimensions. Both results used 20 clusters. The test hardware was a dual quad-core machine. The authors compared Clojure to ParaKMeans-a Web application-and R-a software environment for statistical computing that has a k -means function. ParaKMeans could not handle the larger dataset; considering it is a Web application, it is not a good comparison for Clojure. The results show that Clojure performs worse than R for the smaller dataset, but better than R for the larger dataset. As Kraus and Kestler conclude, for the smaller dataset, "the computational overhead of the parallelization negatively effects [sic] the runtime." Unfortunately, the authors do not include the algorithms used, although the R documentation mentions that it uses a better algorithm than the one commonly used. Appendix A includes the 113-line Clojure source code that runs if a source file and a sampling function are provided. While this paper may present Clojure's value, the benchmark results require more analysis. Online Computing Reviews Service

      Access critical reviews of Computing literature here

      Become a reviewer for Computing Reviews.

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Other conferences
        ELW '09: Proceedings of the 6th European Lisp Workshop
        July 2009
        35 pages
        ISBN:9781605585390
        DOI:10.1145/1562868

        Copyright © 2009 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 6 July 2009

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader