skip to main content
10.1145/1183614.1183649acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
Article

Multi-evidence, multi-criteria, lazy associative document classification

Published: 06 November 2006 Publication History

Abstract

We present a novel approach for classifying documents that combines different pieces of evidence (e.g., textual features of documents, links, and citations) transparently, through a data mining technique which generates rules associating these pieces of evidence to predefined classes. These rules can contain any number and mixture of the available evidence and are associated with several quality criteria which can be used in conjunction to choose the "best" rule to be applied at classification time. Our method is able to perform evidence enhancement by link forwarding/backwarding (i.e., navigating among documents related through citation), so that new pieces of link-based evidence are derived when necessary. Furthermore, instead of inducing a single model (or rule set) that is good on average for all predictions, the proposed approach employs a lazy method which delays the inductive process until a document is given for classification, therefore taking advantage of better qualitative evidence coming from the document. We conducted a systematic evaluation of the proposed approach using documents from the ACM Digital Library and from a Brazilian Web directory. Our approach was able to outperform in both collections all classifiers based on the best available evidence in isolation as well as state-of-the-art multi-evidence classifiers. We also evaluated our approach using the standard WebKB collection, where our approach showed gains of 1% in accuracy, being 25 times faster. Further, our approach is extremely efficient in terms of computational performance, showing gains of more than one order of magnitude when compared against other multi-evidence classifiers.

References

[1]
R. Amsler. Application of citation-based automatic classification. Technical report, The University of Texas at Austin, Linguistics Research Center, 1972.]]
[2]
L. Breiman, J. Friedman, R. Olshen, and C. Stone. Classification and regression trees. Wadsworth Intl., 1984.]]
[3]
S. Brin, R. Motwani, and C. Silverstein. Beyond market baskets: Generalizing association rules to correlations. In Proc. of the ACM SIGMOD97, pages 265--276, May 1997.]]
[4]
P. Calado, M. Cristo, E. Moura, N. Ziviani, B. Ribeiro-Neto, an M. Gonçalves. Combining link-based and content-based methods for web document classification. In Proc. of the ACM CIKM03, pages 394--401, 2003.]]
[5]
D. Cohn and T. Hofmann. The missing link - A probabilistic model of document content and hypertext connectivity. In Advances in Neural Inf. Processing Systems, pages 430--436. MIT Press, 2001.]]
[6]
S. Dasgupta, M. Littman, and D. McAllester. PAC generalization bounds for cotraining. In Proc. of Neural Inf. Processing Systems, 2001.]]
[7]
M. Fisher and R. Everson. When are links useful? Experiments in text classification. In Proc. of ECIR03, pages 41--56, Pisa, Italy, April 2003.]]
[8]
J. Friedman, R. Kohavi, and Y. Yun. Lazy decision trees. In Proc. of the Nat. Conf. on Artificial Intelligence, pages 717--724, Menlo Park, 1996.]]
[9]
J. Furnkranz. Exploiting structural information for text classification on the WWW. In Proc. of the IDA99, pages 487--498, Amsterdam, August 1999.]]
[10]
D. Gibson, J. M. Kleinberg, and P. Raghavan. Inferring Web communities from link topology. In Proc. of the ACM Conf. on Hypertext and Hypermedia, pages 225--234, Pittsburgh, PA, USA, June 1998.]]
[11]
T. Joachims, N. Cristianini, and J. Shawe-Taylor. Composite kernels for hypertext categorisation. In Proc. of the ICML01, pages 250--257, June 2001.]]
[12]
W. Li, J. Han, and J. Pei. CMAR: Efficient classification based on multiple class-association rules. In Proc. of the ICDM01, pages 369--376, 2001.]]
[13]
B. Liu, W. Hsu, and Y. Ma. Integrating classification and association rule mining. In Knowledge Discovery and Data Mining, pages 80--86, 1998.]]
[14]
J. Quinlan. C4.5: Programs for machine learning. Morgan Kaufmann, 1993.]]
[15]
F. Sebastiani. Machine learning in automated text categorization. ACM Comp. Surveys, 34(1):1--47, 2002.]]
[16]
A. Silva, E. Veloso, P. Golgher, B. Ribeiro-Neto, A. Laender, and N. Ziviani. CobWeb - a crawler for the Brazilian Web. In Proc. of the SPIRE99, pages 184--191, 1999.]]
[17]
H. Small. Co-citation in the scientific literature: A new measure of relationship between two documents. JASIS, 24(4):265--269, 1973.]]
[18]
A. Sun, E.-P. Lim, and W.-K. Ng. Web classification using support vector machine. In Proc. of the Intl. Work. on Web Inf. and Data Management, pages 96--99, USA, Nov. 2002.]]
[19]
P. Tan, V. Kumar, and J. Srivastava. Selecting the right interestingness measures for association patterns. In Proc. of the ACM SIGKDD02, pages 32--41, 2002.]]
[20]
L. Terveen, W. Hill, and B. Amento. Constructing, organizing, and visualizing collections of topically related web resources. ACM Trans. Computer-Human Interaction., 6(1):67--94, March 1999.]]
[21]
Y. Yang. Expert network: Effective and efficient learning from human decisions in text categorization and retrieval. In Proc. of the ACM SIGIR94, pages 13--22, Dublin, Ireland, July 1994.]]
[22]
Y. Yang, S. Slattery, and R. Ghani. A study of approaches to hypertext categorization. Journal of Intell. Inf. Systems, 18(2--3):219--241, 2002.]]
[23]
X. Yin and J. Han. CPAR: Classification based on predictive association rules. In Proc. of the SDM03. SIAM, 2003.]]
[24]
M. Zaki and C. Aggarwal. XRules: An effective structural classifier for XML data. In Proc. of the ACM SIGKDD03. ACM Press, 2003.]]
[25]
B. Zhang, Y. Chen, W. Fan, E. Fox, M. Gonçalves, P. Calado, and M. Cristo. Intelligent GP fusion from multiple sources for text classification. In Proc. of the CIKM05, 2005.]]

Cited By

View all
  • (2020)Automatic Disambiguation of Author Names in Bibliographic RepositoriesSynthesis Lectures on Information Concepts, Retrieval, and Services10.2200/S01011ED1V01Y202005ICR07012:1(1-146)Online publication date: 28-May-2020
  • (2018)A MODEL FOR AUTOMATED MATCHING BETWEEN JOB MARKET DEMAND AND UNIVERSITY CURRICULA OFFERSEEU Review10.1515/seeur-2017-002412:2(188-217)Online publication date: 11-May-2018
  • (2017)An association rules based method for classifying product offers from e-shoppingIntelligent Data Analysis10.3233/IDA-15044421:3(637-660)Online publication date: 29-Jun-2017
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
CIKM '06: Proceedings of the 15th ACM international conference on Information and knowledge management
November 2006
916 pages
ISBN:1595934332
DOI:10.1145/1183614
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 06 November 2006

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. classification
  2. data mining
  3. lazy algorithms

Qualifiers

  • Article

Conference

CIKM06
CIKM06: Conference on Information and Knowledge Management
November 6 - 11, 2006
Virginia, Arlington, USA

Acceptance Rates

Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

Upcoming Conference

CIKM '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)5
  • Downloads (Last 6 weeks)2
Reflects downloads up to 06 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2020)Automatic Disambiguation of Author Names in Bibliographic RepositoriesSynthesis Lectures on Information Concepts, Retrieval, and Services10.2200/S01011ED1V01Y202005ICR07012:1(1-146)Online publication date: 28-May-2020
  • (2018)A MODEL FOR AUTOMATED MATCHING BETWEEN JOB MARKET DEMAND AND UNIVERSITY CURRICULA OFFERSEEU Review10.1515/seeur-2017-002412:2(188-217)Online publication date: 11-May-2018
  • (2017)An association rules based method for classifying product offers from e-shoppingIntelligent Data Analysis10.3233/IDA-15044421:3(637-660)Online publication date: 29-Jun-2017
  • (2017)A survey on tag recommendation methodsJournal of the Association for Information Science and Technology10.1002/asi.2373668:4(830-844)Online publication date: 1-Apr-2017
  • (2016)Supervised Microalgae Classification in Imbalanced Dataset2016 5th Brazilian Conference on Intelligent Systems (BRACIS)10.1109/BRACIS.2016.020(49-54)Online publication date: Oct-2016
  • (2014)Disambiguating publication venue titles using association rulesProceedings of the 14th ACM/IEEE-CS Joint Conference on Digital Libraries10.5555/2740769.2740783(77-85)Online publication date: 8-Sep-2014
  • (2014)Disambiguating publication venue titles using association rulesIEEE/ACM Joint Conference on Digital Libraries10.1109/JCDL.2014.6970153(77-86)Online publication date: Sep-2014
  • (2014)Self-training author name disambiguation for information scarce scenariosJournal of the Association for Information Science and Technology10.1002/asi.2299265:6(1257-1278)Online publication date: 1-Jun-2014
  • (2013)Certainty-based active learning for sampling imbalanced datasetsNeurocomputing10.1016/j.neucom.2013.03.023119(350-358)Online publication date: Nov-2013
  • (2013)SpaDeSComputer Networks: The International Journal of Computer and Telecommunications Networking10.1016/j.comnet.2012.07.01557:2(526-539)Online publication date: 1-Feb-2013
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media