| Clustering documents in a web directory |
| Full text |
Pdf
(181 KB)
|
| Source
|
Workshop On Web Information And Data Management
archive
Proceedings of the 5th ACM international workshop on Web information and data management
table of contents
New Orleans, Louisiana, USA
SESSION: Web clustering and usage mining
table of contents
Pages: 66 - 73
Year of Publication: 2003
ISBN:1-58113-725-7
|
|
Authors
|
|
| Sponsors |
|
| Publisher |
|
| Bibliometrics |
Downloads (6 Weeks): 15, Downloads (12 Months): 92, Citation Count: 1
|
|
|
ABSTRACT
Hierarchical categorization of documents is a task receiving growing interest due to the widespread proliferation of topic hierarchies for text documents. The worst problem of hierarchical supervised classifiers is their high demand in terms of labeled examples, whose amount is related to the number of topics in the taxonomy. Hence, bootstrapping a huge hierarchy with a proper set of labeled examples is a critical issue. In this paper, we propose some solutions for the bootstrapping problem, implicitly or explicitly using a taxonomy definition: a baseline approach where documents are classified according to class labels, and two clustering approaches, where training is constrained by the a-priori knowledge of the taxonomy structure, both at terminological and topological level. In particular, we propose the TaxSOM model, that clusters a set of documents in a predefined hierarchy of classes, directly exploiting the knowledge of both their topological organization and their lexical description. Experimental evaluation was performed on a set of taxonomies taken from the Google Web directory.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
Charu C. Aggarwal , Stephen C. Gates , Philip S. Yu, On the merits of building categorization systems by supervised clustering, Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining, p.352-356, August 15-18, 1999, San Diego, California, United States
[doi> 10.1145/312129.312279]
|
| |
2
|
|
| |
3
|
|
| |
4
|
M. Bonifacio, P. Bouquet, and P. Traverso. Enabling distributed knowledge management. managerial and technological implications. Informatik/Informatique, 3(1), 2002.
|
| |
5
|
M. Ceci and D. Malerba. Hierarchical classification of html documents with webclassii. In Proc. of the 25th European Conf. on Information Retrieval (ECIR'03), volume 2633 of Lecture Notes in Computer Science, pages 57--72, 2003.
|
| |
6
|
Soumen Chakrabarti , Byron Dom , Rakesh Agrawal , Prabhakar Raghavan, Using Taxonomy, Discriminants, and Signatures for Navigating in Text Databases, Proceedings of the 23rd International Conference on Very Large Data Bases, p.446-455, August 25-29, 1997
|
| |
7
|
|
| |
8
|
|
| |
9
|
|
 |
10
|
|
| |
11
|
|
| |
12
|
B. Jeon and D. Landgrebe. Partially supervised classification using weighted unsupervised clustering. IEEE Trans. on Geoscience and Remote Sensing, 37(2):1073--1079, 1999.
|
| |
13
|
|
| |
14
|
|
| |
15
|
T. Kohonen, S. Kaski, K. Lagus, J. Salojarvi, J. Honkela, V. Paatero, and A. Saarela. Self organization of a massive document collection. IEEE Trans. on Neural Networks, 11(3):574--585, 2000.
|
| |
16
|
|
| |
17
|
A. McCallum and K. Nigam. Text classification by bootstrapping with keywords. In ACL99 - Workshop for Unsupervised Learning in Natural Language Processing, 1999.
|
| |
18
|
Kamal Nigam , Andrew McCallum , Sebastian Thrun , Tom Mitchell, Learning to classify text from labeled and unlabeled documents, Proceedings of the fifteenth national/tenth conference on Artificial intelligence/Innovative applications of artificial intelligence, p.792-799, July 1998, Madison, Wisconsin, United States
|
| |
19
|
|
| |
20
|
|
| |
21
|
|
Peer to Peer - Readers of this Article have also read:
-
Data structures for quadtree approximation and compression
Communications of the ACM
28, 9
Hanan Samet
-
A hierarchical single-key-lock access control using the Chinese remainder theorem
Proceedings of the 1992 ACM/SIGAPP Symposium on Applied computing
Kim S. Lee
, Huizhu Lu
, D. D. Fisher
-
The GemStone object database management system
Communications of the ACM
34, 10
Paul Butterworth
, Allen Otis
, Jacob Stein
-
Putting innovation to work: adoption strategies for multimedia communication systems
Communications of the ACM
34, 12
Ellen Francik
, Susan Ehrlich Rudman
, Donna Cooper
, Stephen Levine
-
An intelligent component database for behavioral synthesis
Proceedings of the 27th ACM/IEEE conference on Design automation
Gwo-Dong Chen
, Daniel D. Gajski
|