|
ABSTRACT
This paper shows how citation-based information and structural content (e.g., title, abstract) can be combined to improve classification of text documents into predefined categories. We evaluate different measures of similarity -- five derived from the citation information of the collection, and three derived from the structural content -- and determine how they can be fused to improve classification effectiveness. To discover the best fusion framework, we apply Genetic Programming (GP) techniques. Our experiments with the ACM Computing Classification Scheme, using documents from the ACM Digital Library, indicate that GP can discover similarity functions superior to those based solely on a single type of evidence. Effectiveness of the similarity functions discovered through simple majority voting is better than that of content-based as well as combination-based Support Vector Machine classifiers. Experiments also were conducted to compare the performance between GP techniques and other fusion techniques such as Genetic Algorithms (GA) and linear fusion. Empirical results show that GP was able to discover better similarity functions than GA or other fusion techniques.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
R. Amsler. Application of citation-based automatic classification. Technical report, The University of Texas at Austin, Linguistics Research Center, Austin, TX, 1972.
|
 |
2
|
Pável Calado , Marco Cristo , Edleno Moura , Nivio Ziviani , Berthier Ribeiro-Neto , Marcos André Gonçalves, Combining link-based and content-based methods for web document classification, Proceedings of the twelfth international conference on Information and knowledge management, November 03-08, 2003, New Orleans, LA, USA
[doi> 10.1145/956863.956938]
|
 |
3
|
Soumen Chakrabarti , Byron Dom , Piotr Indyk, Enhanced hypertext categorization using hyperlinks, Proceedings of the 1998 ACM SIGMOD international conference on Management of data, p.307-318, June 01-04, 1998, Seattle, Washington, United States
|
| |
4
|
S. M. Cheang, K. H. Lee, and K. S. Leung. Data classification using genetic parallel programming. In GECCO-03, volume 2724 of LNCS, pages 1918--1919, Chicago, 2003.
|
| |
5
|
CITIDEL. Computing and Information Technology Interactive Digital Educational Library, www.citidel.org, 2004.
|
 |
6
|
Chris Clack , Johnny Farringdon , Peter Lidwell , Tina Yu, Autonomous document classification for business, Proceedings of the first international conference on Autonomous agents, p.201-208, February 05-08, 1997, Marina del Rey, California, United States
[doi> 10.1145/267658.267716]
|
| |
7
|
D. Cohn and T. Hofmann. The missing link - a probabilistic model of document content and hypertext connectivity. In NIPS 13, pages 430--436. MIT Press, 2001.
|
| |
8
|
|
 |
9
|
|
| |
10
|
J. Eggermont, J. N. Kok, and W. A. Kosters. Genetic programming for data classification: Refining the search space. In Proc. of BNAIC-03, pages 123--130, Nijmegen, 2003.
|
| |
11
|
|
| |
12
|
|
| |
13
|
|
| |
14
|
M. Fisher and R. Everson. When are links useful? Experiments in text classification. In Proc. of ECIR-03, pages 41--56, 2003.
|
| |
15
|
|
 |
16
|
Eric J. Glover , Kostas Tsioutsiouliklis , Steve Lawrence , David M. Pennock , Gary W. Flake, Using web structure for classifying and describing web pages, Proceedings of the 11th international conference on World Wide Web, May 07-11, 2002, Honolulu, Hawaii, USA
[doi> 10.1145/511446.511520]
|
 |
17
|
|
| |
18
|
|
| |
19
|
|
| |
20
|
|
| |
21
|
N. Kampanya, R. Shen, S. Kim, C. North, and E. A. Fox. CitiViz: A visual user interface to the CITIDEL system. In Proc. of ECDL-04, pages 122--133, Bath, UK, 2004.
|
| |
22
|
M. M. Kessler. Bibliographic coupling between scientific papers. American Documentation, 14(1):10--25, 1963.
|
| |
23
|
J. K. Kishore, L. M. Patnaik, V. Mani, and V. K. Agrawal. Application of genetic programming for multicategory pattern classification. IEEE TEC-00, 4(3):242--258, 2000.
|
| |
24
|
|
| |
25
|
|
| |
26
|
A. Krowne and E. A. Fox. An architecture for multischeming in digital libraries. In Proc. of ICADL-03, pages 563--577, Kuala Lumpur, Malaysia, 2003.
|
| |
27
|
|
| |
28
|
|
| |
29
|
|
| |
30
|
T. M. Mitchell. Machine learning. McGraw Hill, New York, US, 1996.
|
 |
31
|
|
| |
32
|
H. G. Small. Co-citation in the scientific literature: A new measure of relationship between two documents. JASIS, 24(4):265--269, 1973.
|
 |
33
|
|
| |
34
|
|
| |
35
|
|
| |
36
|
B. Zhang, M. A. Gonçalves, W. Fan, Y. Chen, E. A. Fox, P. Calado, and M. Cristo. Intelligent fusion of structural and citation-based evidence for text classification. Technical Report TR-04-16, Computer Science, Virginia Tech, 2004.
|
| |
37
|
B. Zhang, M. A. Gonçalves, and E. A. Fox. An OAI-based filtering service for CITIDEL from NDLTD. In Proc. of ICADL-03, pages 590--601, Kuala Lumpur, Malaysia, 2003.
|
CITED BY 2
|
Adriano Veloso , Wagner Meira, Jr. , Marco Cristo , Marcos Gonçalves , Mohammed Zaki, Multi-evidence, multi-criteria, lazy associative document classification, Proceedings of the 15th ACM international conference on Information and knowledge management, November 06-11, 2006, Arlington, Virginia, USA
|
|
Anísio Lacerda , Marco Cristo , Marcos André Gonçalves , Weiguo Fan , Nivio Ziviani , Berthier Ribeiro-Neto, Learning to advertise, Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, August 06-11, 2006, Seattle, Washington, USA
|
|