|
ABSTRACT
We discuss an approach to the automatic expansion of
domain-specific lexicons, that is, to the problem of
extending, for each ci in a predefined set
C =
{c1,…,cm} of
semantic domains, an initial lexicon
Li0 into a larger lexicon
Li1. Our approach relies on
term categorization, defined as the task of labeling
previously unlabeled terms according to a predefined set of
domains. We approach this as a supervised learning problem in which
term classifiers are built using the initial lexicons as training
data. Dually to classic text categorization tasks in which
documents are represented as vectors in a space of terms, we
represent terms as vectors in a space of documents. We present the
results of a number of experiments in which we use a boosting-based
learning device for training our term classifiers. We test the
effectiveness of our method by using WordNetDomains, a well-known
large set of domain-specific lexicons, as a benchmark. Our
experiments are performed using the documents in the Reuters Corpus
Volume 1 as implicit representations for our terms.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
|
| |
2
|
Ault, T. and Yang, Y. 2001. kNN, Rocchio and metrics for information filtering at TREC-10. In Proceedings of 10th Text Retrieval Conference (TREC-10). E. M. Voorhees, Ed. National Institute of Standards and Technology, Gaithersburg, MD. 84--93.
|
| |
3
|
|
| |
4
|
Chen, H., Schuffels, C., and Orwing, R. 1996. Internet categorization and search: A machine learning approach. J. Visual Comm. Image Represent. Special Issue on Digital Libraries, 7, 1, 88--102.
|
| |
5
|
|
 |
6
|
|
| |
7
|
Dagan, I. 2000. Contextual word similarity. In Handbook of Natural Language Processing, R. Dale, H. Moisl, and H. Somers, Eds. Marcel Dekker Inc, New York, NY. Chapter 19, 459--476.
|
| |
8
|
Dagan, I., Marcus, S., and Markovitch, S. 1995. Contextual word similarity and estimation from sparse data. Comput. Speech Lang. 9, 2, 123--152.
|
| |
9
|
Fellbaum, C., Ed. 1998. WordNet: An Electronic Lexical Database. The MIT Press, Cambridge, MA.
|
| |
10
|
Gale, W., Church, K., and Yarowsky, D. 1993. A method for disambiguating word senses in a large corpus. Comput. Humanities 26, 5/6, 415--439.
|
| |
11
|
|
| |
12
|
Hirschman, L., Grishman, R., and Sager, N. 1988. Grammatically-based automatic word class formation. Inform. Process. Manage. 11, 1/2, 39--57.
|
| |
13
|
Lynette Hirschman , Marc Light , Eric Breck , John D. Burger, Deep Read: a reading comprehension system, Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics, p.325-332, June 20-26, 1999, College Park, Maryland
[doi> 10.3115/1034678.1034731]
|
| |
14
|
Jing, Y. and Croft, W. B. 1994. An association thesaurus for information retrieval. In Proceedings of 4th International Conference Recherche d'Information Assistee par Ordinateur (RIAO'94). New York, NY. 146--160.
|
| |
15
|
|
| |
16
|
Lesk, M. E. 1969. Word-word association in document retrieval systems. Ameri. Document. 20, 1, 27--38.
|
| |
17
|
|
| |
18
|
|
| |
19
|
Lund, K. and Burgess, C. 1996. Producing high-dimensional semantic spaces from lexical co-occurrence. Behav. Resear. Meth. Instrument. Comput. 28, 2, 203--208.
|
| |
20
|
Magnini, B. and Cavaglià, G. 2000. Integrating subject field codes into WordNet. In Proceedings of 2nd International Conference on Language Resources and Evaluation (LREC'00). Athens, Greece. 1413--1418.
|
| |
21
|
|
| |
22
|
Moldovan, D., Harabagiu, S., Paşca, M., Mihalcea, R., Goodrum, R., Gîrju, R., and Rus, V. 1999. LASSO: A tool for surfing the answer net. In Proceedings of 8th Text Retrieval Conference (TREC-8). Gaithersburg, MD. 175--183.
|
| |
23
|
Nardiello, P., Sebastiani, F., and Sperduti, A. 2003. Discretizing continuous attributes in AdaBoost for text categorization. In Proceedings of 25th European Conference on Information Retrieval (ECIR'03), Pisa, Italy, Springer Verlag, 320--334.
|
| |
24
|
|
 |
25
|
|
| |
26
|
|
| |
27
|
|
| |
28
|
Rose, T., Stevenson, M., and Whitehead, M. 2002. The Reuters Corpus Volume 1---from yesterday's news to tomorrow's language resources. In Proceedings of 3rd International Conference on Language Resources and Evaluation (LREC'02). Las Palmas, Spain, 827--832.
|
| |
29
|
|
| |
30
|
Sahlgren, M. 2004. Random indexing of words in narrow context windows for vector-based semantic analysis. In Acquisition and Representation of Word Meaning: Theoretical and Computational Perspectives, A. Lenci, S. Montemagni, and V. Pirrelli, Eds. Istituti Editoriali Poligrafici Internazionali, Pisa, Italy.
|
| |
31
|
Salton, G. 1971. Experiments in automatic thesaurus construction for information retrieval. In Proceedings of the IFIP Congress. Vol. TA-2. Ljubljana, Yugoslavia, 43--49.
|
| |
32
|
|
 |
33
|
|
 |
34
|
Bruce R. Schatz , Eric H. Johnson , Pauline A. Cochrane , Hsinchun Chen, Interactive term suggestion for users of digital libraries: using subject thesauri and co-occurrence lists for information retrieval, Proceedings of the first ACM international conference on Digital libraries, p.126-133, March 20-23, 1996, Bethesda, Maryland, United States
[doi> 10.1145/226931.226956]
|
| |
35
|
Schäuble, P. and Knaus, D. 1992. The various roles of information structures. In Proceedings of the 16th Annual Conference of the Gesellschaft für Klassifikation, Dortmund, Germany, O. Opitz, B. Lausen, and R. Klar, Eds. 282--290. Springer Verlag, Heidelberg, Germany, 1993.
|
| |
36
|
Schmid, H. 1994. Probabilistic part-of-speech tagging using decision trees. In Proceedings of the International Conference on New Methods in Language Processing. Manchester, UK, 44--49.
|
| |
37
|
|
| |
38
|
|
 |
39
|
|
 |
40
|
Fabrizio Sebastiani , Alessandro Sperduti , Nicola Valdambrini, An improved boosting algorithm and its application to text categorization, Proceedings of the ninth international conference on Information and knowledge management, p.78-85, November 06-11, 2000, McLean, Virginia, United States
[doi> 10.1145/354756.354804]
|
 |
41
|
|
| |
42
|
|
| |
43
|
|
| |
44
|
Soderland, S., Fisher, D., Aseltine, J., and Lehnert, W. 1995. CRYSTAL: Inducing a conceptual dictionary. In Proceedings of 14th International Joint Conference on Artificial Intelligence (IJCAI'95). Montreal, Canada, 1314--1319.
|
| |
45
|
Spärck Jones, K. 1971. Automatic Keyword Classification for Information Retrieval. Butterworths, London, UK.
|
| |
46
|
|
| |
47
|
Tokunaga, T., Iwayama, M., and Tanaka, H. 1995. Automatic thesaurus construction based on grammatical relations. In Proceedings of 14th International Joint Conference on Artificial Intelligence (IJCAI'95). Montreal, Canada, 1308--1313.
|
| |
48
|
|
| |
49
|
|
REVIEW
"Peter Patton : Reviewer"
The major problems in computational linguistics processing arise from the differing genres and semantic domains of similar texts. In addition, some genres and domains are more amenable to automatic language translation and other linguistic process
more...
|