ACM Home Page
Please provide us with feedback. Feedback
A structure-sensitive framework for text categorization
Full text PdfPdf (91 KB)
Source Conference on Information and Knowledge Management archive
Proceedings of the 14th ACM international conference on Information and knowledge management table of contents
Bremen, Germany
POSTER SESSION: Poster Session table of contents
Pages: 337 - 338  
Year of Publication: 2005
ISBN:1-59593-140-6
Authors
Ganesh Ramakrishnan  IBM IRL, New Delhi, India
Deepa Paranjpe  Shankar Nagar, Nagpur, India
Byron Dom  Yahoo! Inc., Sunnyvale, CA
Sponsors
ACM: Association for Computing Machinery
SIGIR: ACM Special Interest Group on Information Retrieval
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 2,   Downloads (12 Months): 33,   Citation Count: 0
Additional Information:

abstract   references   index terms   collaborative colleagues  

Tools and Actions: Review this Article  
Save this Article to a Binder    Display Formats: BibTex  EndNote ACM Ref   
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1099554.1099655
What is a DOI?

ABSTRACT

This paper presents a framework called Structure Sensitive CATegorization(SSCAT), that exploits document structure for improved categorization. There are two parts to this framework, viz. (1) Documents often have layout structure, such that logically coherent text is grouped together into fields using some mark-up language. We use a log-linear model, which associates one or more features with each field. Weights associated with the field features are learnt from training data and these weights quantify the per-class importance of the field features in determining the category for the document. (2) We employ a technique that exploits the parse tree of fields that are phrasal constructs, such as title and associates weights with words in these constructs while boosting weights of important words called focus words. These weights are learnt from example instances of phrasal constructs, marked with the corresponding focus words. The learning is accomplished by training a classifier that uses linguistic features obtained from the text's parse structure. The weighted words, in fields with phrasal constructs, are used in obtaining features for the corresponding fields in the overall framework. SSCAT was tested on the supervised categorization task of over one million products from Yahoo!'s on-line shopping data. With an accuracy of over 90%, our classifier outperforms Naive Bayes and Support Vector Machines. This not only shows the effectiveness of SSCAT but also strengthens our belief that linguistic features based on natural language structure can improve tasks such as text categorization.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
A. K. McCallum. Mallet: A machine learning for language toolkit. http://mallet.cs.umass.edu, 2002.
 
2
G. Ramakrishnan, D. Paranjpe, and B. Dom. A structure sensitive framework for text categorizaton. http://www.cse.iitb.ac.in/ hare/papers/sscat-TR.pdf.
 
3

Collaborative Colleagues:
Ganesh Ramakrishnan: colleagues
Deepa Paranjpe: colleagues
Byron Dom: colleagues