ACM Home Page
Please provide us with feedback. Feedback
Automatic expansion of domain-specific lexicons by term categorization
Full text PdfPdf (589 KB)
Source ACM Transactions on Speech and Language Processing (TSLP) archive
Volume 3 ,  Issue 1  (May 2006) table of contents
Pages: 1 - 30  
Year of Publication: 2006
ISSN:1550-4875
Authors
Henri Avancini  Consiglio Nazionale delle Ricerche, Pisa, Italy
Alberto Lavelli  ITC-irst, Povo (TN), Italy
Fabrizio Sebastiani  Consiglio Nazionale delle Ricerche, Pisa, Italy
Roberto Zanoli  ITC-irst, Povo (TN), Italy
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 10,   Downloads (12 Months): 102,   Citation Count: 1
Additional Information:

abstract   references   cited by   index terms   review   collaborative colleagues  

Tools and Actions: Review this Article  
Save this Article to a Binder    Display Formats: BibTex  EndNote ACM Ref   
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1138379.1138380
What is a DOI?

ABSTRACT

We discuss an approach to the automatic expansion of domain-specific lexicons, that is, to the problem of extending, for each ci in a predefined set C = {c1,…,cm} of semantic domains, an initial lexicon Li0 into a larger lexicon Li1. Our approach relies on term categorization, defined as the task of labeling previously unlabeled terms according to a predefined set of domains. We approach this as a supervised learning problem in which term classifiers are built using the initial lexicons as training data. Dually to classic text categorization tasks in which documents are represented as vectors in a space of terms, we represent terms as vectors in a space of documents. We present the results of a number of experiments in which we use a boosting-based learning device for training our term classifiers. We test the effectiveness of our method by using WordNetDomains, a well-known large set of domain-specific lexicons, as a benchmark. Our experiments are performed using the documents in the Reuters Corpus Volume 1 as implicit representations for our terms.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
 
2
Ault, T. and Yang, Y. 2001. kNN, Rocchio and metrics for information filtering at TREC-10. In Proceedings of 10th Text Retrieval Conference (TREC-10). E. M. Voorhees, Ed. National Institute of Standards and Technology, Gaithersburg, MD. 84--93.
 
3
 
4
Chen, H., Schuffels, C., and Orwing, R. 1996. Internet categorization and search: A machine learning approach. J. Visual Comm. Image Represent. Special Issue on Digital Libraries, 7, 1, 88--102.
 
5
6
 
7
Dagan, I. 2000. Contextual word similarity. In Handbook of Natural Language Processing, R. Dale, H. Moisl, and H. Somers, Eds. Marcel Dekker Inc, New York, NY. Chapter 19, 459--476.
 
8
Dagan, I., Marcus, S., and Markovitch, S. 1995. Contextual word similarity and estimation from sparse data. Comput. Speech Lang. 9, 2, 123--152.
 
9
Fellbaum, C., Ed. 1998. WordNet: An Electronic Lexical Database. The MIT Press, Cambridge, MA.
 
10
Gale, W., Church, K., and Yarowsky, D. 1993. A method for disambiguating word senses in a large corpus. Comput. Humanities 26, 5/6, 415--439.
 
11
 
12
Hirschman, L., Grishman, R., and Sager, N. 1988. Grammatically-based automatic word class formation. Inform. Process. Manage. 11, 1/2, 39--57.
 
13
 
14
Jing, Y. and Croft, W. B. 1994. An association thesaurus for information retrieval. In Proceedings of 4th International Conference Recherche d'Information Assistee par Ordinateur (RIAO'94). New York, NY. 146--160.
 
15
 
16
Lesk, M. E. 1969. Word-word association in document retrieval systems. Ameri. Document. 20, 1, 27--38.
 
17
 
18
 
19
Lund, K. and Burgess, C. 1996. Producing high-dimensional semantic spaces from lexical co-occurrence. Behav. Resear. Meth. Instrument. Comput. 28, 2, 203--208.
 
20
Magnini, B. and Cavaglià, G. 2000. Integrating subject field codes into WordNet. In Proceedings of 2nd International Conference on Language Resources and Evaluation (LREC'00). Athens, Greece. 1413--1418.
 
21
 
22
Moldovan, D., Harabagiu, S., Paşca, M., Mihalcea, R., Goodrum, R., Gîrju, R., and Rus, V. 1999. LASSO: A tool for surfing the answer net. In Proceedings of 8th Text Retrieval Conference (TREC-8). Gaithersburg, MD. 175--183.
 
23
Nardiello, P., Sebastiani, F., and Sperduti, A. 2003. Discretizing continuous attributes in AdaBoost for text categorization. In Proceedings of 25th European Conference on Information Retrieval (ECIR'03), Pisa, Italy, Springer Verlag, 320--334.
 
24
25
 
26
 
27
 
28
Rose, T., Stevenson, M., and Whitehead, M. 2002. The Reuters Corpus Volume 1---from yesterday's news to tomorrow's language resources. In Proceedings of 3rd International Conference on Language Resources and Evaluation (LREC'02). Las Palmas, Spain, 827--832.
 
29
 
30
Sahlgren, M. 2004. Random indexing of words in narrow context windows for vector-based semantic analysis. In Acquisition and Representation of Word Meaning: Theoretical and Computational Perspectives, A. Lenci, S. Montemagni, and V. Pirrelli, Eds. Istituti Editoriali Poligrafici Internazionali, Pisa, Italy.
 
31
Salton, G. 1971. Experiments in automatic thesaurus construction for information retrieval. In Proceedings of the IFIP Congress. Vol. TA-2. Ljubljana, Yugoslavia, 43--49.
 
32
33
34
 
35
Schäuble, P. and Knaus, D. 1992. The various roles of information structures. In Proceedings of the 16th Annual Conference of the Gesellschaft für Klassifikation, Dortmund, Germany, O. Opitz, B. Lausen, and R. Klar, Eds. 282--290. Springer Verlag, Heidelberg, Germany, 1993.
 
36
Schmid, H. 1994. Probabilistic part-of-speech tagging using decision trees. In Proceedings of the International Conference on New Methods in Language Processing. Manchester, UK, 44--49.
 
37
 
38
39
40
41
 
42
 
43
 
44
Soderland, S., Fisher, D., Aseltine, J., and Lehnert, W. 1995. CRYSTAL: Inducing a conceptual dictionary. In Proceedings of 14th International Joint Conference on Artificial Intelligence (IJCAI'95). Montreal, Canada, 1314--1319.
 
45
Spärck Jones, K. 1971. Automatic Keyword Classification for Information Retrieval. Butterworths, London, UK.
 
46
 
47
Tokunaga, T., Iwayama, M., and Tanaka, H. 1995. Automatic thesaurus construction based on grammatical relations. In Proceedings of 14th International Joint Conference on Artificial Intelligence (IJCAI'95). Montreal, Canada, 1308--1313.
 
48
 
49



REVIEW

"Peter Patton : Reviewer"

The major problems in computational linguistics processing arise from the differing genres and semantic domains of similar texts. In addition, some genres and domains are more amenable to automatic language translation and other linguistic process  more...

Collaborative Colleagues:
Henri Avancini: colleagues
Alberto Lavelli: colleagues
Fabrizio Sebastiani: colleagues
Roberto Zanoli: colleagues