ACM Home Page
Please provide us with feedback. Feedback
A concept-based model for enhancing text categorization
Full text PdfPdf (887 KB)
Source
Conference on Knowledge Discovery in Data archive
Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining table of contents
San Jose, California, USA
SESSION: Research track papers table of contents
Pages: 629 - 637  
Year of Publication: 2007
ISBN:978-1-59593-609-7
Authors
Shady Shehata  University of Waterloo
Fakhri Karray  University of Waterloo
Mohamed Kamel  University of Waterloo
Sponsors
ACM: Association for Computing Machinery
SIGKDD: ACM Special Interest Group on Knowledge Discovery in Data
SIGMOD: ACM Special Interest Group on Management of Data
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 27,   Downloads (12 Months): 465,   Citation Count: 0
Additional Information:

abstract   references   index terms   collaborative colleagues  

Tools and Actions: Review this Article  
Save this Article to a Binder    Display Formats: BibTex  EndNote ACM Ref   
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1281192.1281260
What is a DOI?

ABSTRACT

Most of text categorization techniques are based on word and/or phrase analysis of the text. Statistical analysis of a term frequency captures the importance of the term within a document only. However, two terms can have the same frequency in their documents, but one term contributes moreto the meaning of its sentences than the other term. Thus, the underlying model should indicate terms that capture these mantics of text. In this case, the model can capture terms that present the concepts of the sentence, which leads todiscover the topic of the document. A new concept-based model that analyzes terms on the sentence and document levels rather than the traditional analysis of document only is introduced. The concept-based model can effectively discriminate between non-important terms with respect to sentence semantics and terms which hold the concepts that represent the sentence meaning. The proposed model consists of concept-based statistical analyzer, conceptual ontological graph representation,and concept extractor. The term which contributes to the sentence semantics is assigned two different weights by the concept-based statistical analyzer and the conceptual ontological graph representation. These two weights are combined into a new weight. The concepts that have maximum combined weights are selected by the concept extractor. A set of experiments using the proposed concept-basedmodel on different datasets in text categorization is conducted. The experiments demonstrate the comparison between traditional weighting and the concept-based weighting obtained by the combined approach of the concept-based statistical analyzer and the conceptual ontological graph. The evaluation of results is relied on two quality measures, the Macro-averaged F1 and the Error rate. These quality measures are improved when the newly developedconcept-based model is used to enhance the quality of thetext categorization.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
K. Aas and L. Eikvil. Text categorisation: A survey. technical report 941. Technical report, Norwegian Computing Center, June 1999.
 
2
 
3
R. Feldman and I. Dagan. Knowledge discovery in textual databases (kdt). In Proceedings of First International Conference on Knowledge Discovery and Data Mining, pages 112--117, 1995.
 
4
C. Fillmore. The case for case. Chapter in: Universals in Linguistic Theory. Holt, Rinehart and Winston, Inc., New York, 1968.
 
5
W. Francis and H. Kucera. Manual of information to accompany a standard corpus of present-day edited american english, for use with digital computers, 1964.
 
6
 
7
 
8
 
9
P. Kingsbury and M. Palmer. Propbank: the next level of treebank. In Proceedings of Treebanks and Lexical Theories, 2003.
 
10
M. F. Porter. An algorithm for suffix stripping. Program, 14(3): 130--137, July 1980.
 
11
 
12
 
13
S. Pradhan, W. Ward, K. Hacioglu, J. Martin, and D. Jurafsky. Shallow semantic parsing using support vector machines. In Proceedings of the Human Language Technology/North American Association for Computational Linguistics (HLT/NAACL), 2004.
 
14
15
 
16
 
17

Collaborative Colleagues:
Shady Shehata: colleagues
Fakhri Karray: colleagues
Mohamed Kamel: colleagues