ACM Home Page
Please provide us with feedback. Feedback
Generalized component analysis for text with heterogeneous attributes
Full text QuickTime FormatMov (17:42), pdf formatPdf (805 KB)
Source
Conference on Knowledge Discovery in Data archive
Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining table of contents
San Jose, California, USA
SESSION: Research track papers table of contents
Pages: 794 - 803  
Year of Publication: 2007
ISBN:978-1-59593-609-7
Authors
Xuerui Wang  University of Massachusetts
Chris Pal  University of Massachusetts
Andrew McCallum  University of Massachusetts
Sponsors
ACM: Association for Computing Machinery
SIGKDD: ACM Special Interest Group on Knowledge Discovery in Data
SIGMOD: ACM Special Interest Group on Management of Data
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 12,   Downloads (12 Months): 178,   Citation Count: 0
Additional Information:

abstract   references   index terms   collaborative colleagues  

Tools and Actions: Review this Article  
Save this Article to a Binder    Display Formats: BibTex  EndNote ACM Ref   
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1281192.1281277
What is a DOI?

ABSTRACT

We present a class of richly structured, undirected hidden variable models suitable for simultaneously modeling text along with other attributes encoded in different modalities. Our model generalizes techniques such as principal component analysis to heterogeneous data types. In contrast to other approaches, this framework allows modalities such as words, authors and timestamps to be captured in their natural, probabilistic encodings. A latent space representation for a previously unseen document can be obtained through a fast matrix multiplication using our method. We demonstrate the effectiveness of our framework on the task of author prediction from 13 years of the NIPS conference proceedings and for a recipient prediction task using a 10-month academic email archive of a researcher. Our approach should be more broadly applicable to many real-world applications where one wishes to efficiently make predictions for a large number of potential outputs using dimensionality reduction in a well defined probabilistic framework.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
C. Andrieu, N. de Freitas, A. Doucet, and M. Jordan. An introduction to MCMC for machine learning. Machine Learning, 50:5--43, 2003.
 
2
D. Blei and J. Lafferty. Dynamic topic models. In Proceedings of the 23rd International Conference on Machine Learning, Pittsburgh, PA, USA, 2006.
 
3
D. Blei, A. Ng, and M. Jordan. Latent Dirichlet allocation. Journal of Machine Learning Research, 3:993--1022, 2003.
 
4
W. Buntine and A. Jakulin. Applying discrete PCA in data analysis. In M. Chickering and J. Halpern, editors, Proceedings of the 20th Conference on Uncertainty in Artificial Intel ligence, pages 59--66, Banff, Alberta, Canada, 2004.
 
5
S. F. Chen and R. Rosenfeld. A Gaussian prior for smoothing maximum entropy models. Technical report, Carnegie Mellon University, CMU-CS-99-108, 1999.
 
6
S. C. Deerwester, S. T. Dumais, T. K. Landauer, G. W. Furnas, and R. A. Harshman. Indexing by latent semantic analysis. Journal of the American Society of Information Science, 41(6):391--407, 1990.
 
7
E. Erosheva, S. Fienberg, and J. Lafferty. Mixed membership models of scientific publications. Proceedings of the National Academy of Sciences, 101(Suppl. 1), 2004.
 
8
P. Gehler, A. Holub, and M. Welling. The rate adapting Poisson model for information retrieval and object recognition. In Proceedings of the 23rd International Conference on Machine Learning, 2006.
 
9
T. Griffiths and M. Steyvers. Finding scientific topics. Proceedings of the National Academy of Sciences, 101(suppl. 1):5228--5235, 2004.
 
10
T. Griffiths, M. Steyvers, D. Blei, and J. Tenenbaum. Integrating topics and syntax. In Advances in Neural Information Processing Systems 17, Vancouver, British Columbia, Canada, 2004.
 
11
T. Hastie, R. Tibshirani, and J. H. Friedman. The Elements of Statistical Learning. Springer, 2001.
 
12
G. Hinton. Training products of experts by minimizing contrastive divergence. Neural Computation, 14:1771--1800, 2002.
 
13
T. Hofmann. Probabilistic latent semantic analysis. In Proceedings of the 15th Conference on Uncertainty in Artificial Intel ligence, Stockholm, Sweden, 1999.
 
14
I. T. Jolliffe. Principal Component Analysis. Springer Verlag, 2002.
 
15
M. I. Jordan, Z. Ghahramani, T. S. Jaakkola, and L. K. Saul. An introduction to variational methods for graphical models. In Proceedings of the NATO Advanced Study Institute on Learning in graphical models, pages 105--161, 1998.
 
16
C. Kemp, T. L. Griffiths, and J. Tenenbaum. Discovering latent classes in relational data. Technical report, MIT CSAIL, 2004.
 
17
A. McCallum, A. Corrada-Emanuel, and X. Wang. Topic and role discovery in social networks. In Proceedings of the 18th International Joint Conference on Artificial Intelligence, 2005.
 
18
A. McCallum, C. Pal, G. Druck, and X. Wang. Multi-conditional learning: Generative/discriminative training for clustering and classification. In Proceedings of the 21st National Conference on Artificial Intel ligence, 2006.
 
19
T. Minka and J. Lafferty. Expectation-propagation for the generative aspect model. In Proceedings of the 18th Conference on Uncertainty in Artificial Intelligence, 2002.
 
20
K. Nowicki and T. A. Snijders. Estimation and prediction for stochastic blockstructures. Journal of the American Statistical Association, 96(455), 2001.
 
21
S. Roweis and Z. Ghahramani. A unifying review of linear Gaussian models. Neural Computation, 11(2):305--345, 1999.
 
22
R. Salakhutdinov, A. Mnih, and G. Hinton. Restricted Boltzmann machines for collaborative filtering. In Proceedings of the 24th International Conference on Machine Learning, 2007.
 
23
G. Salton and M. McGill.Introduction to Modern Information Retrieval. McGraw-Hill, 1983.
 
24
M. Steyvers, P. Smyth, M. Rosen-Zvi, and T. Griffiths. Probabilistic author-topic models for information discovery. In Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2004.
 
25
Y. W. Teh, M. I. Jordan, M. J. Beal, and D. M. Blei. Hierarchical Dirichlet processes. Technical report, University of California, Berkeley, Department of Statistics, 2004.
 
26
M. E. Tipping and C. M. Bishop. Probabilistic principal component analysis. Journal of the Royal Statistical Society, Series B, 21(3):611--622, 1990.
 
27
X. Wang and A. McCallum. Topics over time: A non-Markov continuous-time model of topical trends. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2006.
 
28
X. Wang, N. Mohanty, and A. McCallum. Group and topic discovery from relations and their attributes. In Advances in Neural Information Processing Systems 18, Vancouver, British Columbia, Canada, 2005.
 
29
M. Welling, M. Rosen-Zvi, and G. Hinton. Exponential family harmoniums with an application to information retrieval. In Advances in Neural Information Processing Systems 17, Vancouver, British Columbia, Canada, 2004.
 
30
E. Xing, R. Yan, and A. G. Hauptmann. Mining associated text and images with dual--wing harmoniums. In Proceedings of the 21st Conference on Uncertainty in Artificial Intel ligence, 2005.
 
31
J. Yang, Y. Liu, E. P. Xing, and A. Hauptmann. Harmonium-based models for semantic video representation and classification. In Proceedings of the Seventh SIAM International Conference on Data Mining, 2007.
 
32
S. Yu, K. Yu, V. Tresp, H.-P. Kriegel, and M. Wu. Supervised probabilistic principal component analysis. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 464--473, New York, NY, USA, 2006. ACM Press.
 
33
C. Zhai and J. Lafferty. A study of smoothing methods for language models applied to information retrieval. ACM Transactions on Information System, 22(2):179--214, 2004.

Collaborative Colleagues:
Xuerui Wang: colleagues
Chris Pal: colleagues
Andrew McCallum: colleagues