|
ABSTRACT
The text categorization module described here provides a front-end filtering function for the larger DR-LINK text retrieval system [Liddy and Myaeing 1993]. The model evaluates a large incoming stream of documents to determine which documents are sufficiently similar to a profile at the broad subject level to warrant more refined representation and matching. To accomplish this task, each substantive word in a text is first categorized using a feature set based on the semantic Subject Field Codes (SFCs) assigned to individual word senses in a machine-readable dictionary. When tested on 50 user profiles and 550 megabytes of documents, results indicate that the feature set that is the basis of the text categorization module and the algorithm that establishes the boundary of categories of potentially relevant documents accomplish their tasks with a high level of performance.
This means that the category of potentially relevant documents for most profiles would contain at least 80% of all documents later determined to be relevant to the profile. The number of documents in this set would be uniquely determined by the system's category-boundary predictor, and this set is likely to contain less than 5% of the incoming stream of documents.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
M. J. Blosseville , G. Hébrail , M. G. Monteil , N. Pénot, Automatic document classification: natural language processing, statistical analysis, and expert system techniques used together, Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval, p.51-58, June 21-24, 1992, Copenhagen, Denmark
[doi> 10.1145/133160.133175]
|
| |
2
|
|
| |
3
|
CHOUE~(A, T. AND LUSIGNAN, S. 1985. Disambiguation by short contexts. Comput. Hum. 19, 3, 147 157.
|
| |
4
|
HALE, R. L. 1990. MYSTAT Stattstzcal Apphcat*ons. Course Technology, Inc, Cambridge, Mass.
|
| |
5
|
|
| |
6
|
|
| |
7
|
KELLY, E. F. AND STONE, P.J. 1975. Computer Recognition of English Word Senses. North Holland, Amsterdam.
|
| |
8
|
KROVETZ, R. 1991. Lexical acquisition and information retrieval. In Lexical Acquisition: Exploiting On-Line Resources to Build a Lexicon, U. Zernik, Ed., Lawrence Earlbaum, Hillsdale, N.J..
|
| |
9
|
|
| |
10
|
|
| |
11
|
LIDDY, E.n. 1994. Development and implementation of a discourse model for newspaper texts. In Proceedings of the Dagstuhl on Summarizing Text for Intelligent Communication (Saarbrilken, Germany). International Conference and Research Center for Computer Science in Schloss Dagstuhl. To be published.
|
| |
12
|
|
| |
13
|
LintY, E. D. AND PAIK, W. (1992). Statistically-guided word sense disambiguation. In Proceed- ~ngs of AAAI Fall '92 Symposium on Probabilistic Approaches to Natural Language (Boston, Mass.). AAAI, Menlo Park, Calif.
|
 |
14
|
|
| |
15
|
McGmL, M., KOLL, M., AND NOREAVLT, T. 1979. An evaluation of factors affecting document ranking by information retrieval systems. Final Report to National Science Foundation. Syracuse Univ., Syracuse, N.Y.
|
| |
16
|
METEER, M., SCHWARTZ, R., AND WEISCHEDEL, R. 1991. POST: Using probabilities in language processing. In Proceedings of the 12th Internahonal Jotnt Conference on Artificial Intelligence (Sydney, Australia). Morgan Kaufmann, San Mateo, Calif.
|
| |
17
|
PAIK, W., LmDY, E. D., Yu, E. S., AND MCKENNA, M. 1993. Extracting and classifying proper nouns in documents. In Proceedings of the Human Language Technology Workshop (Princeton, N.J.). ARPA, Washington, D.C.
|
| |
18
|
SAGER, W. K. H. ~D LOCk--N, P.C. 1976. Classification of ranking algorithms. Int. Forum Inf. Doc. 1, 4, 2-25.
|
| |
19
|
SLATOR, B. 1991. Using context for sense preference. In Lexical Acquisition: Exploiting On- Line Resources to Build a Lexicon, Zernik, U. Ed, Lawrence Earlbaum, Hillsdale, N.J.
|
| |
20
|
|
| |
21
|
TANIMOTO, T. 1958. An elementary mathematical theory of classification and prediction. Int. Rep., IBM Corp., Watson Research Center, Kingston, N.Y.
|
| |
22
|
WALKER, D. E. AND AMSLER, R.A. 1986. The use of machine-readable dictionaries in sublang-uage analysis. In Analyzing Language in Restricted Domains: Sublanguage Descriptwn and Processing, R. Grishman and R. Kittredge, Eds., Lawrence Earlbaum, Hillsdale, N.J.
|
CITED BY 6
|
Allen Brewer , Wei Ding , Karla Hahn , Anita Komlodi, The role of intermediary services in emerging digital libraries, Proceedings of the first ACM international conference on Digital libraries, p.29-35, March 20-23, 1996, Bethesda, Maryland, United States
|
|
|
|
|
|
|
J. Mostafa , S. Mukhopadhyay , M. Palakal , W. Lam, A multilevel approach to intelligent information filtering: model, system, and evaluation, ACM Transactions on Information Systems (TOIS), v.15 n.4, p.368-399, Oct. 1997
|
|
|
|
|
REVIEW
"Richard S. Marcus : Reviewer"
The authors describe a module of their DR-LINK text retrieval
system. This module filters texts (in this case from a database of
Wall Street Journal news stories) as
likely to be relevant to a Text Retrieval
more...
Peer to Peer - Readers of this Article have also read:
-
M4: a metamodel for data preprocessing
Proceedings of the 4th ACM international workshop on Data warehousing and OLAP
Anca Vaduva
, Jörg-Uwe Kietz
, Regina Zücker
-
Data structures for quadtree approximation and compression
Communications of the ACM
28, 9
Hanan Samet
-
A hierarchical single-key-lock access control using the Chinese remainder theorem
Proceedings of the 1992 ACM/SIGAPP Symposium on Applied computing
Kim S. Lee
, Huizhu Lu
, D. D. Fisher
-
The GemStone object database management system
Communications of the ACM
34, 10
Paul Butterworth
, Allen Otis
, Jacob Stein
-
Putting innovation to work: adoption strategies for multimedia communication systems
Communications of the ACM
34, 12
Ellen Francik
, Susan Ehrlich Rudman
, Donna Cooper
, Stephen Levine
|