|
ABSTRACT
Using a mixture of random variables to model data is a tried-and-tested method common in data mining, machine learning, and statistics. By using mixture modeling it is often possible to accurately model even complex, multimodal data via very simple components. However, the classical mixture model assumes that a data point is generated by a single component in the model. A lot of datasets can be modeled closer to the underlying reality if we drop this restriction. We propose a probabilistic framework, the mixture-of-subsets (MOS) model, by making two fundamental changes to the classical mixture model. First, we allow a data point to be generated by a set of components, rather than just a single component. Next, we limit the number of data attributes that each component can influence. We also propose an EM framework to learn the MOS model from a dataset, and experimentally evaluate it on real, high-dimensional datasets. Our results show that the MOS model learned from the data represents the underlying nature of the data accurately.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
Charu C. Aggarwal , Joel L. Wolf , Philip S. Yu , Cecilia Procopiuc , Jong Soo Park, Fast algorithms for projected clustering, Proceedings of the 1999 ACM SIGMOD international conference on Management of data, p.61-72, May 31-June 03, 1999, Philadelphia, Pennsylvania, United States
|
 |
2
|
|
 |
3
|
Rakesh Agrawal , Johannes Gehrke , Dimitrios Gunopulos , Prabhakar Raghavan, Automatic subspace clustering of high dimensional data for data mining applications, Proceedings of the 1998 ACM SIGMOD international conference on Management of data, p.94-105, June 01-04, 1998, Seattle, Washington, United States
|
| |
4
|
|
| |
5
|
Aldous, D. J. 1985. Exchangeability and related topics. In ‘Ecole d’ “Et” e de Probabilities de SainFlour XII. Lecture Notes in Math, vol. 1117. Springer, Berlin.
|
| |
6
|
|
 |
7
|
Arindam Banerjee , Inderjit Dhillon , Joydeep Ghosh , Srujana Merugu , Dharmendra S. Modha, A generalized maximum entropy approach to bregman co-clustering and matrix approximation, Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, August 22-25, 2004, Seattle, WA, USA
[doi> 10.1145/1014052.1014111]
|
| |
8
|
Bilmes, J. 1998. A gentle tutorial of the EM algorithm and its application to parameter estimation for Gaussian mixture and hidden Markov models. Tech. Rep. ICSI-TR-97-021, University of Berkeley.
|
 |
9
|
Igor V. Cadez , Scott Gaffney , Padhraic Smyth, A general probabilistic framework for clustering individuals and objects, Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, p.140-149, August 20-23, 2000, Boston, Massachusetts, United States
[doi> 10.1145/347090.347119]
|
 |
10
|
Igor V. Cadez , Padhraic Smyth , Heikki Mannila, Probabilistic modeling of transaction data with applications to profiling, visualization, and prediction, Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining, p.37-46, August 26-29, 2001, San Francisco, California
[doi> 10.1145/502512.502523]
|
 |
11
|
|
 |
12
|
Chun-Hung Cheng , Ada Waichee Fu , Yi Zhang, Entropy-based subspace clustering for mining numerical data, Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining, p.84-93, August 15-18, 1999, San Diego, California, United States
[doi> 10.1145/312129.312199]
|
| |
13
|
Dempster, A. P., Laird, N. M., and Rubin, D. B. 1977. Maximum likelihood from incomplete data via the em algorithm. J. Royal Statist. Soc. B-39, 1--39.
|
| |
14
|
|
 |
15
|
|
| |
16
|
Friedman, J. and Meulman, J. 2004. Clustering objects on subsets of attributes. J. Royal Statist. Soc. Series B(Statist. Methodol.) 66, 4, 815--849.
|
 |
17
|
Bin Gao , Tie-Yan Liu , Xin Zheng , Qian-Sheng Cheng , Wei-Ying Ma, Consistent bipartite graph co-partitioning for star-structured high-order heterogeneous data co-clustering, Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining, August 21-24, 2005, Chicago, Illinois, USA
[doi> 10.1145/1081870.1081879]
|
| |
18
|
Graham, M. and Miller, D. 2006. Unsupervised learning of parsimonious mixtures on large spaces with integrated feature and component selection. IEEE Trans. Signal Proc. 54, 4, 1289--1303.
|
| |
19
|
Griffiths, T. and Ghahramani, Z. 2006. Infinite latent feature models and the Indian buffet process. In Advances in Neural Information Processing Systems 18, Y. Weiss, et al., eds. MIT Press, Cambridge, MA, 475--482.
|
 |
20
|
Bing Liu , Yiyuan Xia , Philip S. Yu, Clustering through decision tree construction, Proceedings of the ninth international conference on Information and knowledge management, p.20-29, November 06-11, 2000, McLean, Virginia, United States
[doi> 10.1145/354756.354775]
|
| |
21
|
McLachlan, G. J. and Basford, K. E. 1988. Mixture Models: Inference and Applications to Clustering. Marcel Dekker, New York.
|
| |
22
|
McLachlan, G. J., Bean, R. W., and Peel, D. 2002. A mixture model-based approach to the clustering of microarray expression data. Bioinformatics 18, 3, 413--422.
|
| |
23
|
McLachlan, G. J. and Peel, D. 2000. Finite Mixture Models. Wiley, New York.
|
| |
24
|
Nagesh, H., Goil, S., and Choudhary, A. 1999. Mafia: Efficient and scalable subspace clustering for very large datasets. Tech. Rep. CPDC-TR-9906-010. Center for Parallel and Distributed Computing, North Western University.
|
| |
25
|
Pitman, J. 2002. Combinatorial stochastic processes. Notes for Saint Flour Summer School.
|
 |
26
|
|
| |
27
|
Woo, K.-G., Lee, J.-H., Kim, M.-H., and Lee, Y.-J. 2004. Findit: A fast and intelligent subspace clustering algorithm using dimension voting. Inf. Softw. Technol. 46, 4, 255--271.
|
| |
28
|
|
|