Article

Multi-evidence, multi-criteria, lazy associative document classification

Authors:

Adriano Veloso,

Wagner Meira, Jr.,

Marcos Gonçalves,

Mohammed ZakiAuthors Info & Claims

CIKM '06: Proceedings of the 15th ACM international conference on Information and knowledge management

Pages 218 - 227

https://doi.org/10.1145/1183614.1183649

Published: 06 November 2006 Publication History

Abstract

We present a novel approach for classifying documents that combines different pieces of evidence (e.g., textual features of documents, links, and citations) transparently, through a data mining technique which generates rules associating these pieces of evidence to predefined classes. These rules can contain any number and mixture of the available evidence and are associated with several quality criteria which can be used in conjunction to choose the "best" rule to be applied at classification time. Our method is able to perform evidence enhancement by link forwarding/backwarding (i.e., navigating among documents related through citation), so that new pieces of link-based evidence are derived when necessary. Furthermore, instead of inducing a single model (or rule set) that is good on average for all predictions, the proposed approach employs a lazy method which delays the inductive process until a document is given for classification, therefore taking advantage of better qualitative evidence coming from the document. We conducted a systematic evaluation of the proposed approach using documents from the ACM Digital Library and from a Brazilian Web directory. Our approach was able to outperform in both collections all classifiers based on the best available evidence in isolation as well as state-of-the-art multi-evidence classifiers. We also evaluated our approach using the standard WebKB collection, where our approach showed gains of 1% in accuracy, being 25 times faster. Further, our approach is extremely efficient in terms of computational performance, showing gains of more than one order of magnitude when compared against other multi-evidence classifiers.

References

[1]

R. Amsler. Application of citation-based automatic classification. Technical report, The University of Texas at Austin, Linguistics Research Center, 1972.]]

[2]

L. Breiman, J. Friedman, R. Olshen, and C. Stone. Classification and regression trees. Wadsworth Intl., 1984.]]

[3]

S. Brin, R. Motwani, and C. Silverstein. Beyond market baskets: Generalizing association rules to correlations. In Proc. of the ACM SIGMOD97, pages 265--276, May 1997.]]

Digital Library

[4]

P. Calado, M. Cristo, E. Moura, N. Ziviani, B. Ribeiro-Neto, an M. Gonçalves. Combining link-based and content-based methods for web document classification. In Proc. of the ACM CIKM03, pages 394--401, 2003.]]

Digital Library

[5]

D. Cohn and T. Hofmann. The missing link - A probabilistic model of document content and hypertext connectivity. In Advances in Neural Inf. Processing Systems, pages 430--436. MIT Press, 2001.]]

[6]

S. Dasgupta, M. Littman, and D. McAllester. PAC generalization bounds for cotraining. In Proc. of Neural Inf. Processing Systems, 2001.]]

[7]

M. Fisher and R. Everson. When are links useful? Experiments in text classification. In Proc. of ECIR03, pages 41--56, Pisa, Italy, April 2003.]]

[8]

J. Friedman, R. Kohavi, and Y. Yun. Lazy decision trees. In Proc. of the Nat. Conf. on Artificial Intelligence, pages 717--724, Menlo Park, 1996.]]

[9]

J. Furnkranz. Exploiting structural information for text classification on the WWW. In Proc. of the IDA99, pages 487--498, Amsterdam, August 1999.]]

Digital Library

[10]

D. Gibson, J. M. Kleinberg, and P. Raghavan. Inferring Web communities from link topology. In Proc. of the ACM Conf. on Hypertext and Hypermedia, pages 225--234, Pittsburgh, PA, USA, June 1998.]]

Digital Library

[11]

T. Joachims, N. Cristianini, and J. Shawe-Taylor. Composite kernels for hypertext categorisation. In Proc. of the ICML01, pages 250--257, June 2001.]]

Digital Library

[12]

W. Li, J. Han, and J. Pei. CMAR: Efficient classification based on multiple class-association rules. In Proc. of the ICDM01, pages 369--376, 2001.]]

Digital Library

[13]

B. Liu, W. Hsu, and Y. Ma. Integrating classification and association rule mining. In Knowledge Discovery and Data Mining, pages 80--86, 1998.]]

Digital Library

[14]

J. Quinlan. C4.5: Programs for machine learning. Morgan Kaufmann, 1993.]]

Digital Library

[15]

F. Sebastiani. Machine learning in automated text categorization. ACM Comp. Surveys, 34(1):1--47, 2002.]]

Digital Library

[16]

A. Silva, E. Veloso, P. Golgher, B. Ribeiro-Neto, A. Laender, and N. Ziviani. CobWeb - a crawler for the Brazilian Web. In Proc. of the SPIRE99, pages 184--191, 1999.]]

Digital Library

[17]

H. Small. Co-citation in the scientific literature: A new measure of relationship between two documents. JASIS, 24(4):265--269, 1973.]]

[18]

A. Sun, E.-P. Lim, and W.-K. Ng. Web classification using support vector machine. In Proc. of the Intl. Work. on Web Inf. and Data Management, pages 96--99, USA, Nov. 2002.]]

Digital Library

[19]

P. Tan, V. Kumar, and J. Srivastava. Selecting the right interestingness measures for association patterns. In Proc. of the ACM SIGKDD02, pages 32--41, 2002.]]

Digital Library

[20]

L. Terveen, W. Hill, and B. Amento. Constructing, organizing, and visualizing collections of topically related web resources. ACM Trans. Computer-Human Interaction., 6(1):67--94, March 1999.]]

Digital Library

[21]

Y. Yang. Expert network: Effective and efficient learning from human decisions in text categorization and retrieval. In Proc. of the ACM SIGIR94, pages 13--22, Dublin, Ireland, July 1994.]]

Digital Library

[22]

Y. Yang, S. Slattery, and R. Ghani. A study of approaches to hypertext categorization. Journal of Intell. Inf. Systems, 18(2--3):219--241, 2002.]]

Digital Library

[23]

X. Yin and J. Han. CPAR: Classification based on predictive association rules. In Proc. of the SDM03. SIAM, 2003.]]

[24]

M. Zaki and C. Aggarwal. XRules: An effective structural classifier for XML data. In Proc. of the ACM SIGKDD03. ACM Press, 2003.]]

Digital Library

[25]

B. Zhang, Y. Chen, W. Fan, E. Fox, M. Gonçalves, P. Calado, and M. Cristo. Intelligent GP fusion from multiple sources for text classification. In Proc. of the CIKM05, 2005.]]

Digital Library

Cited By

Ferreira AGonçalves MLaender A(2020)Automatic Disambiguation of Author Names in Bibliographic RepositoriesSynthesis Lectures on Information Concepts, Retrieval, and Services10.2200/S01011ED1V01Y202005ICR07012:1(1-146)Online publication date: 28-May-2020
https://doi.org/10.2200/S01011ED1V01Y202005ICR070
Januzaj YLuma ASelimi BAliu ARaufi BSnopçe H(2018)A MODEL FOR AUTOMATED MATCHING BETWEEN JOB MARKET DEMAND AND UNIVERSITY CURRICULA OFFERSEEU Review10.1515/seeur-2017-002412:2(188-217)Online publication date: 11-May-2018
https://doi.org/10.1515/seeur-2017-0024
Oliveira CPereira D(2017)An association rules based method for classifying product offers from e-shoppingIntelligent Data Analysis10.3233/IDA-15044421:3(637-660)Online publication date: 29-Jun-2017
https://doi.org/10.3233/IDA-150444
Show More Cited By

Index Terms

Multi-evidence, multi-criteria, lazy associative document classification
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Supervised learning
        Supervised learning by classification
    2. Machine learning approaches
      1. Classification and regression trees
2. Information systems
  1. Information retrieval

Recommendations

A Lazy Approach to Associative Classification

Associative classification is a promising technique to build accurate classifiers. However, in large or correlated datasets, association rule mining may yield huge rule sets. Hence, several pruning techniques have been proposed to select a small subset ...
Categorizing the Document Using Multi Class Classification in Data Mining
CICN '11: Proceedings of the 2011 International Conference on Computational Intelligence and Communication Networks

Classification is the process of dividing the data into number of groups which are either dependent or independent of each other and each group acts as a class. The task of Classification can be done by using several methods using different types of ...
Parallel multi-objective genetic algorithms for associative classification rule mining
ICCCS '11: Proceedings of the 2011 International Conference on Communication, Computing & Security

Association and classification rule mining are two well-known techniques used in data mining. The integrated approach is known as associative classification rule mining (ACRM), which has helped in developing a compact and efficient classifier for the ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

CIKM '06: Proceedings of the 15th ACM international conference on Information and knowledge management

November 2006

916 pages

ISBN:1595934332

DOI:10.1145/1183614

General Chair:
Philip S. Yu
IBM T.J. Watson Research Center (USA)
,
Program Chairs:
Vassilis Tsotras
University of California-Riverside (USA)
,
Edward Fox
Virginia Tech (USA)
,
Bing Liu
University of Illinois at Chicago (USA)

Copyright © 2006 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 06 November 2006

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Article

Conference

CIKM06

Sponsor:

CIKM06: Conference on Information and Knowledge Management

November 6 - 11, 2006

Virginia, Arlington, USA

Acceptance Rates

Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

Upcoming Conference

CIKM '25

Sponsor:
sigir
sigir

The 34th ACM International Conference on Information and Knowledge Management

November 10 - 14, 2025

Seoul , Republic of Korea

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

22
Total Citations
View Citations
472
Total Downloads

Downloads (Last 12 months)5
Downloads (Last 6 weeks)2

Reflects downloads up to 06 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Ferreira AGonçalves MLaender A(2020)Automatic Disambiguation of Author Names in Bibliographic RepositoriesSynthesis Lectures on Information Concepts, Retrieval, and Services10.2200/S01011ED1V01Y202005ICR07012:1(1-146)Online publication date: 28-May-2020
https://doi.org/10.2200/S01011ED1V01Y202005ICR070
Januzaj YLuma ASelimi BAliu ARaufi BSnopçe H(2018)A MODEL FOR AUTOMATED MATCHING BETWEEN JOB MARKET DEMAND AND UNIVERSITY CURRICULA OFFERSEEU Review10.1515/seeur-2017-002412:2(188-217)Online publication date: 11-May-2018
https://doi.org/10.1515/seeur-2017-0024
Oliveira CPereira D(2017)An association rules based method for classifying product offers from e-shoppingIntelligent Data Analysis10.3233/IDA-15044421:3(637-660)Online publication date: 29-Jun-2017
https://doi.org/10.3233/IDA-150444
Belém FAlmeida JGonçalves M(2017)A survey on tag recommendation methodsJournal of the Association for Information Science and Technology10.1002/asi.2373668:4(830-844)Online publication date: 1-Apr-2017
https://dl.acm.org/doi/10.1002/asi.23736
Correa IDrews PSouza MTavano V(2016)Supervised Microalgae Classification in Imbalanced Dataset2016 5th Brazilian Conference on Intelligent Systems (BRACIS)10.1109/BRACIS.2016.020(49-54)Online publication date: Oct-2016
https://doi.org/10.1109/BRACIS.2016.020
Pereira Dda Silva EEsmin ABuchanan GKlein MRauber ACunningham S(2014)Disambiguating publication venue titles using association rulesProceedings of the 14th ACM/IEEE-CS Joint Conference on Digital Libraries10.5555/2740769.2740783(77-85)Online publication date: 8-Sep-2014
https://dl.acm.org/doi/10.5555/2740769.2740783
Pereira DBraga da Silva EEsmin A(2014)Disambiguating publication venue titles using association rulesIEEE/ACM Joint Conference on Digital Libraries10.1109/JCDL.2014.6970153(77-86)Online publication date: Sep-2014
https://doi.org/10.1109/JCDL.2014.6970153
Ferreira AVeloso AGonçalves MLaender A(2014)Self-training author name disambiguation for information scarce scenariosJournal of the Association for Information Science and Technology10.1002/asi.2299265:6(1257-1278)Online publication date: 1-Jun-2014
https://dl.acm.org/doi/10.1002/asi.22992
Fu JLee S(2013)Certainty-based active learning for sampling imbalanced datasetsNeurocomputing10.1016/j.neucom.2013.03.023119(350-358)Online publication date: Nov-2013
https://doi.org/10.1016/j.neucom.2013.03.023
Las-Casas PGuedes DAlmeida JZiviani AMarques-Neto H(2013)SpaDeSComputer Networks: The International Journal of Computer and Telecommunications Networking10.1016/j.comnet.2012.07.01557:2(526-539)Online publication date: 1-Feb-2013
https://dl.acm.org/doi/10.1016/j.comnet.2012.07.015
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents