| An investigation of linguistic features and clustering algorithms for topical document clustering |
| Full text |
Pdf
(859 KB)
|
| Source
|
Annual ACM Conference on Research and Development in Information Retrieval
archive
Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
table of contents
Athens, Greece
Pages: 224 - 231
Year of Publication: 2000
ISBN:1-58113-226-3
|
|
Authors
|
|
Vasileios Hatzivassiloglou
|
Department of Computer Science, Columbia Unwersity, 1214 Amsterdam Avenue, New York, NY
|
|
Luis Gravano
|
Department of Computer Science, Columbia Unwersity, 1214 Amsterdam Avenue, New York, NY
|
|
Ankineedu Maganti
|
Department of Computer Science, Columbia Unwersity, 1214 Amsterdam Avenue, New York, NY
|
|
| Sponsors |
|
| Publisher |
|
| Bibliometrics |
Downloads (6 Weeks): 7, Downloads (12 Months): 99, Citation Count: 11
|
|
|
ABSTRACT
We investigate four hierarchical clustering methods (single-link, complete-link, groupwise-average, and single-pass) and two linguistically motivated text features (noun phrase heads and proper names) in the context of document clustering. A statistical model for combining similarity information from multiple sources is described and applied to DARPA's Topic Detection and Tracking phase 2 (TDT2) data. This model, based on log-linear regression, alleviates the need for extensive search in order to determine optimal weights for combining input features. Through an extensive series of experiments with more than 40,000 documents from multiple news sources and modalities, we establish that both the choice of clustering algorithm and the introduction of the additional features have an impact on clustering performance. We apply our optimal combination of features to the TDT2 test data, obtaining partitions of the documents that compare favorably with the results obtained by participants in the official TDT2 competition.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
John Aberdeen , John Burger , David Day , Lynette Hirschman , Patricia Robinson , Marc Vilain, MITRE: description of the Alembic system used for MUC-6, Proceedings of the 6th conference on Message understanding, November 06-08, 1995, Columbia, Maryland
[doi> 10.3115/1072399.1072413]
|
| |
2
|
D. M. Bates and D. G. Watts. NonlinearRegressionAnalysis and its Applications. Wiley, New York, 1988.
|
 |
3
|
|
| |
4
|
J. Fiscus, G. Doddington, J. Garofolo, and A. Martin. NIST's 1998 Topic Detection and Tracking evaluation (TDT2). In Proceedings of the 1999 DARPA Broadcast News Workshop, pages 19-24, Hemdon, Virginia, February-March 1999.
|
| |
5
|
|
| |
6
|
|
 |
7
|
|
| |
8
|
L. Kaufman and P. J. Rousseeuw. Finding Groups in Data: An Introduction to Cluster Analysis. Wiley, New York, 1990.
|
| |
9
|
Mark Llberman. Topic Detection and Tracking Principal Investigators meeting, 1998.
|
| |
10
|
Stephen A. Lowe. The beta-bmomml mixture model and its application to TDT tracking and detection. In Proceedings of the 1999 DARPA Broadcast News Workshop, pages 127-131, Hemdon, Virginia, February-March 1999.
|
| |
11
|
Kathleen R. McKeown , Judith L. Klavans , Vasileios Hatzivassiloglou , Regina Barzilay , Eleazar Eskin, Towards multidocument summarization by reformulation: progress and prospects, Proceedings of the sixteenth national conference on Artificial intelligence and the eleventh Innovative applications of artificial intelligence conference innovative applications of artificial intelligence, p.453-460, July 18-22, 1999, Orlando, Florida, United States
|
| |
12
|
National Institute of Standards and Technology. The Topic Detection and Tracking Phase 2 (TDT2) evaluation plan, 1998. Version 3.7, August 3rd, 1998. Available from http://www.itl .nist.gov/iaui/894. 01/ tdt98/doc/tdt2, eval .plan. 98 .v3.7 .pdf.
|
| |
13
|
Ron Papka, James Allan, and Victor Lavrenko. UMass approaches to detection and tracking at TDT2. In Proceedings of the 1999 DARPA Broadcast News Workshop, pages 111-116, Hemdon, Virginia, February-March 1999.
|
| |
14
|
|
| |
15
|
|
 |
16
|
|
| |
17
|
T. J. Santner and D. E. Duffy. The Statistical Analysis of Discrete Data. Springer-Verlag, New York, 1989.
|
| |
18
|
|
| |
19
|
N. Wacholder. Simplex NPs clustered by head: A method for identifying significant topics in a document. In Proceedings of the COLING/ACL Workshop on the Computational Treatment of Nominals, pages 70-79, Montreal, Canada, October 1998.
|
| |
20
|
|
 |
21
|
|
 |
22
|
|
CITED BY 11
|
|
James Henderson , Paola Merlo , Ivan Petroff , Gerold Schneider, Using syntactic analysis to increase efficiency in visualizing text collections, Proceedings of the 19th international conference on Computational linguistics, p.1-7, August 24-September 01, 2002, Taipei, Taiwan
|
|
|
|
|
Kathleen R. McKeown , Regina Barzilay , David Evans , Vasileios Hatzivassiloglou , Judith L. Klavans , Ani Nenkova , Carl Sable , Barry Schiffman , Sergey Sigelman, Tracking and summarizing news on a daily basis with Columbia's Newsblaster, Proceedings of the second international conference on Human Language Technology Research, p.280-285, March 24-27, 2002, San Diego, California
|
|
|
|
|
|
|
|
Dou Shen , Qiang Yang , Jian-Tao Sun , Zheng Chen, Thread detection in dynamic text message streams, Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, August 06-11, 2006, Seattle, Washington, USA
|
|
|
|
D. L. Chan , R. W. P. Luk , W. K. Mak , H. V. Leong , E. K. S. Ho , Q. Lu, Multiple related document summary and navigation using concept hierarchies for mobile clients, Proceedings of the 2002 ACM symposium on Applied computing, March 11-14, 2002, Madrid, Spain
|
|
Martin Franz , Todd Ward , J. Scott McCarley , Wei-Jing Zhu, Unsupervised and supervised clustering for topic tracking, Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, p.310-317, September 2001, New Orleans, Louisiana, United States
|
|
|
|
|
|
Hong Yu , Minsuk Lee , David Kaufman , John Ely , Jerome A. Osheroff , George Hripcsak , James Cimino, Development, implementation, and a cognitive evaluation of a definitional question answering system for physicians, Journal of Biomedical Informatics, v.40 n.3, p.236-251, June, 2007
|
Peer to Peer - Readers of this Article have also read:
-
M4: a metamodel for data preprocessing
Proceedings of the 4th ACM international workshop on Data warehousing and OLAP
Anca Vaduva
, Jörg-Uwe Kietz
, Regina Zücker
-
Data structures for quadtree approximation and compression
Communications of the ACM
28, 9
Hanan Samet
-
A hierarchical single-key-lock access control using the Chinese remainder theorem
Proceedings of the 1992 ACM/SIGAPP Symposium on Applied computing
Kim S. Lee
, Huizhu Lu
, D. D. Fisher
-
Putting innovation to work: adoption strategies for multimedia communication systems
Communications of the ACM
34, 12
Ellen Francik
, Susan Ehrlich Rudman
, Donna Cooper
, Stephen Levine
-
The GemStone object database management system
Communications of the ACM
34, 10
Paul Butterworth
, Allen Otis
, Jacob Stein
|