|
ABSTRACT
High dimensional directional data is becoming increasingly important in contemporary applications such as analysis of text and gene-expression data. A natural model for multi-variate directional data is provided by the von Mises-Fisher (vMF) distribution on the unit hypersphere that is analogous to the multi-variate Gaussian distribution in Rd. In this paper, we propose modeling complex directional data as a mixture of vMF distributions. We derive and analyze two variants of the Expectation Maximization (EM) framework for estimating the parameters of this mixture. We also propose two clustering algorithms corresponding to these variants. An interesting aspect of our methodology is that the spherical kmeans algorithm (kmeans with cosine similarity) can be shown to be a special case of both our algorithms. Thus, modeling text data by vMF distributions lends theoretical validity to the use of cosine similarity which has been widely used by the information retrieval community. As part of experimental validation, we present results on modeling high-dimensional text and gene-expression data as a mixture of vMF distributions. The results indicate that our approach yields superior clusterings especially for difficult clustering tasks in high-dimensional spaces.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
|
| |
2
|
A. Banerjee, I. S. Dhillon, J. Ghosh, and S. Sra. Clustering on hyperspheres using Expectation Maximization. Technical Report TR-03-07, Department of Computer Sciences, University of Texas, February 2003.
|
| |
3
|
A. Banerjee and J. Ghosh. Frequency sensitive competitive learning for clustering on high-dimensional hyperspheres. In Proceedings International Joint Conference on Neural Networks, pages 1590--1595, May 2002.
|
| |
4
|
P. S. Bradley, K. P. Bennett, and A. Demiriz. Constrained k-means clustering. Technical report, Microsoft Research, May 2000.
|
| |
5
|
M. Collins. The EM algorithm. In fulfillment of Written Preliminary Exam II requirement, September 1997.
|
| |
6
|
|
| |
7
|
|
| |
8
|
A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B, 39:1--38, 1977.
|
| |
9
|
I. Dhillon and J. Kogan, editors. Proc. Workshop on Clustering High Dimensional Data and its Applications. SIAM, 2002.
|
| |
10
|
|
| |
11
|
|
| |
12
|
I. S. Dhillon and S. Sra. Modeling data using directional distributions. Technical Report TR-06-03, University of Texas, Dept. of Coputer Sciences, February 2003.
|
| |
13
|
M. B. Eisen, P. T. Spellman, P. O. Brown, and D. Botstein. Cluster analysis and display of genome-wide expression patterns. Proc. Natl. Acad. Sci., 95:14863--14868, 1998.
|
| |
14
|
T. R. H. et. al Functional discovery via a compendium of expression profiles. Cell, 102:109--126, 2000.
|
| |
15
|
J. Ghosh. Scalable clustering methods for data mining. In N. Ye, editor, Handbook of Data Mining, pages 247--277. Lawrence Erlbaum, 2003.
|
| |
16
|
G. K. Gupta and J. Ghosh. Detecting seasonal and divergent trends and visualization for very high dimensional transactional data. In Proc. 1st SIAM Intl. Conf. on Data Mining, April 2001.
|
| |
17
|
|
| |
18
|
|
| |
19
|
|
| |
20
|
M. Kearns, Y. Mansour, and A. Ng. An information-theoretic analysis of hard and soft assignment methods for clustering. In 13th Annual Conf. on Uncertainty in Artificial Intelligence (UA197), 1997.
|
| |
21
|
K. V. Mardia. Statistical Distributions in Scienctific Work, volume 3, chapter Characteristics of directional distributions, pages 365--385. Reidel, Dordrecht, 1975.
|
| |
22
|
K. V. Mardia and P. Jupp. Directional Statistics. John Wiley and Sons Ltd., 2nd edition, 2000.
|
| |
23
|
|
| |
24
|
|
| |
25
|
C. R. Rao. Linear Statistical Inference and its Applications. Wiley, New York, 2nd edition, 1973.
|
 |
26
|
Badrul Sarwar , George Karypis , Joseph Konstan , John Reidl, Item-based collaborative filtering recommendation algorithms, Proceedings of the 10th international conference on World Wide Web, p.285-295, May 01-05, 2001, Hong Kong, Hong Kong
[doi> 10.1145/371920.372071]
|
| |
27
|
B. Schölkopf and A. Smola. Learning with Kernels. MIT Press, 2001.
|
| |
28
|
|
| |
29
|
P. Smyth. Clustering sequences with hidden Markov models. In M. C. Mozer, M. I. Jordan, and T. Petsche, editors, Advances in Neural Information Processing, volume 9, pages 648--654. MIT Press, 1997.
|
| |
30
|
A. Strehl, J. Ghosh, and R. Mooney. Impact of similarity measures on web-page clustering. In Proc 7th Natl Conf on Artificial Intelligence: Workshop of AI for Web Search (AAAI 2000), pages 58--64. AAAI, July 2000.
|
CITED BY 9
|
|
|
|
|
|
|
|
|
|
|
|
Bo Long , Xiaoyun Wu , Zhongfei (Mark) Zhang , Philip S. Yu, Unsupervised learning on k-partite graphs, Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, August 20-23, 2006, Philadelphia, PA, USA
|
|
|
|
|
Hui Han , Lee Giles , Hongyuan Zha , Cheng Li , Kostas Tsioutsiouliklis, Two supervised learning approaches for name disambiguation in author citations, Proceedings of the 4th ACM/IEEE-CS joint conference on Digital libraries, June 07-11, 2004, Tuscon, AZ, USA
|
|
|
|
Peer to Peer - Readers of this Article have also read:
-
Data structures for quadtree approximation and compression
Communications of the ACM
28, 9
Hanan Samet
-
A hierarchical single-key-lock access control using the Chinese remainder theorem
Proceedings of the 1992 ACM/SIGAPP Symposium on Applied computing
Kim S. Lee
, Huizhu Lu
, D. D. Fisher
-
The GemStone object database management system
Communications of the ACM
34, 10
Paul Butterworth
, Allen Otis
, Jacob Stein
-
Putting innovation to work: adoption strategies for multimedia communication systems
Communications of the ACM
34, 12
Ellen Francik
, Susan Ehrlich Rudman
, Donna Cooper
, Stephen Levine
-
An intelligent component database for behavioral synthesis
Proceedings of the 27th ACM/IEEE conference on Design automation
Gwo-Dong Chen
, Daniel D. Gajski
|