ABSTRACT
In recent years, the demand for computational power in data mining applications has increased due to rapidly growing data sets. As a consequence, standard algorithms need to be parallelized for fast processing of the generated data sets. Unfortunately, most approaches for parallelizing algorithms require a careful software design and a deep knowledge about thread-safe programming. As a consequence they are hardly applicable for rapid prototyping of new algorithms. We outline the process of multi-core parallelization using Clojure, a new functional programming language utilizing the Java Virtual Machine (JVM) that does not require knowledge of thread-safe programming. We provide some benchmark results for our multi-core algorithm to demonstrate its computational power. The rationale behind Clojure is combining the industry-standard JVM with functional programming, immutable data structures, and a built-in concurrency support via software transactional memory. This makes it a suitable tool for parallelization and rapid prototyping in many areas. In this case study we present a multi-core parallel implementation of the k-means cluster algorithm. The multi-core algorithm shows an increase in computation speed up to a factor of 10 compared to R or network based parallelization.
- A.-R. Adl-Tabatabai, C. Kozyrakis, and B. Saha. Unlocking concurrency. ACM Queue, 4(10):24--33, 2006. Google ScholarDigital Library
- S. Ben-David, D. Pál, and H. U. Simon. Stability of k-means clustering. In N. H. Bshouty and C. Gentile, editors, Conference on Learning Theory, volume 4539 of Lecture Notes in Artificial Intelligence, pages 20--34, Berlin, 2007. Springer. Google ScholarDigital Library
- S. Ben-David, U. von Luxburg, and D. Pál. A sober look at clustering stability. In J. G. Carbonell and J. Siekmann, editors, Conference on Learning Theory, volume 4005 of Lecture Notes in Artificial Intelligence, pages 5--19, Berlin, 2006. Springer. Google ScholarDigital Library
- P. A. Bernstein and N. Goodman. Concurrency control in distributed database systems. ACM Computing Surveys, 13(2):185--221, 1981. Google ScholarDigital Library
- A. Bertoni and G. Valentini. Random projections for assessing gene expression cluster stability. In Proceedings of the IEEE-International Joint Conference on Neural Networks (IJCNN), volume 1, pages 149--154. IEEE Computer Society, 2005.Google ScholarCross Ref
- B. Chapman, G. Jost, and R. van der Pas. Using OpenMP: Portable Shared Memory Parallel Programming. MIT Press, Cambridge, 2007. Google ScholarDigital Library
- E. Fowlkes and C. Mallows. A method for comparing two hierarchical clusterings. Journal of the American Statistical Association, 78(383):553--569, 1983.Google ScholarCross Ref
- W. Gropp, E. Lusk, and A. Skjellum. Using MPI: Portable Parallel Programming with the Message Passing Interface. MIT Press, Cambridge, 1999. Google ScholarDigital Library
- S. Halloway. Programming Clojure. Pragmatic Programmers, Raleigh, 2009. Google ScholarDigital Library
- J. Handl, J. Knowles, and D. Kell. Computational cluster validation in post-genomic data analysis. Bioinformatics, 21(15):3201--3212, 2005. Google ScholarDigital Library
- J. Hill, M. Hambley, T. Forster, M. Mewissen, T. M. Sloan, F. Scharinger, A. Trew, and P. Ghazal. Sprint: A new parallel framework for r. BMC Bioinformatics, 9(558), 2008.Google Scholar
- L. Hubert and P. Arabie. Comparing partitions. Journal of Mathematical Classification, 2:193--218, 1985.Google ScholarCross Ref
- P. Jaccard. Nouvelles recherches sur la distribution florale. Bulletin de la Société Vaudoise des sciences naturelles, 44:223--270, 1908.Google Scholar
- A. Jain and R. Dubes. Algorithms for Clustering Data. Prentice Hall, New Jersey, 1988. Google ScholarDigital Library
- A. K. Jain and J. V. Moreau. Bootstrap technique in cluster analysis. Pattern Recognition, 20(5):547--568, 1987. Google ScholarDigital Library
- H. A. Kestler, A. Müller, F. Schwenker, T. Gress, T. Mattfeldt, and G. Palm. Cluster analysis of comparative genomic hybridization data. Lecture Notes NATO ASI: Aritificial Intelligence and Heuristic Methods for Bioinformatics, pages S--40, 2001. Abstract.Google Scholar
- D. Koenig, A. Glover, P. King, G. Laforge, and J. Skeet. Groovy in Action. Manning Publications Co., Greenwich, 2007. Google ScholarDigital Library
- P. Kraj, A. Sharma, N. Garge, R. Podolsky, and R. A. McIndoe. Parakmeans: Implementation of a parallelized k-means algorithm suitable for general laboratory use. BMC Bioinformatics, 9(200), 2008.Google Scholar
- T. Lange, V. Roth, M. L. Braun, and J. M. Buhmann. Stability-based validation of clustering solutions. Neural Computation, 16(6):1299--1323, 2004. Google ScholarDigital Library
- D. Lea. Concurrent Programming in Java: Design Principles and Patterns. Addison Wesley, Boston, 2nd edition, 2000. Google ScholarDigital Library
- J. MacQueen. Some methods for classification and analysis of multivariate observations. In J. Neyman and L. L. Cam, editors, Proceedings of the 5th Berkeley Symposium on Math, Statistics and Probability, volume 1, pages 281--297, Berkely, 1967. University of California Press.Google Scholar
- C. C. Minh, J. Chung, C. Kozyrakis, and K. Olukotun. Stamp: Stanford transactional applications for multi-processing. In IISWC '08: Proceedings of the IEEE International Symposium on Workload Characterization, pages 35--46, Los Alamitos, 2008. IEEE Computer Society.Google Scholar
- M. Odersky, L. Spoon, and B. Venners. Programming in Scala. Artima, Mountain View, 2008.Google Scholar
- S. Peyton-Jones. Beautiful concurrency. In A. Oram and G. Wilson, editors, Beautiful code, chapter 24. O'Reilly, Sebastopol, 2007.Google Scholar
- R Development Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, 2009. ISBN 3-900051-07-0.Google Scholar
- R. Rajwar and J. Goodman. Transactional execution: Toward reliable, high-performance multithreading. IEEE Micro, 23(6):117--125, 2003. Google ScholarDigital Library
- A. Rakhlin and A. Caponnetto. Stability of k-means clustering. In B. Schölkopf, J. C. Platt, and T. Hoffman, editors, Advances in Neural Information Processing Systems 19, pages 1121--128. MIT Press, Cambridge, 2007.Google Scholar
- W. Rand. Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association, 66:846--850, 1971.Google ScholarCross Ref
- N. Shavit and D. Touitou. Software transactional memory. In Proceedings of the 14th ACM Symposium on Principles of Distributed Computing, pages 204--213, New York, 1995. ACM Press. Google ScholarDigital Library
- S. Thompson. Haskell: The Craft of Functional Programming. Addison Wesley, Boston, 2nd edition, 1999. Google ScholarDigital Library
Index Terms
- Multi-core parallelization in Clojure: a case study
Recommendations
A history of Clojure
Clojure was designed to be a general-purpose, practical functional language, suitable for use by professionals wherever its host language, e.g., Java, would be. Initially designed in 2005 and released in 2007, Clojure is a dialect of Lisp, but is not a ...
Clojure for Number Crunching on Multicore Machines
Clojure is a Lisp language designed to run on a Java Virtual Machine (JVM) and interoperate automatically with all Java libraries. However, compared to Java, Clojure has a concurrency API that encourages programmers to take advantage of multicore ...
Accelerating critical section execution with asymmetric multi-core architectures
ASPLOS 2009To improve the performance of a single application on Chip Multiprocessors (CMPs), the application must be split into threads which execute concurrently on multiple cores. In multi-threaded applications, critical sections are used to ensure that only ...
Comments