Article

An application of text categorization methods to gene ontology annotation

Authors:

Javed MostafaAuthors Info & Claims

SIGIR '05: Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval

Pages 138 - 145

https://doi.org/10.1145/1076034.1076060

Published: 15 August 2005 Publication History

Abstract

This paper describes an application of IR and text categorization methods to a highly practical problem in biomedicine, specifically, Gene Ontology (GO) annotation. GO annotation is a major activity in most model organism database projects and annotates gene functions using a controlled vocabulary. As a first step toward automatic GO annotation, we aim to assign GO domain codes given a specific gene and an article in which the gene appears, which is one of the task challenges at the TREC 2004 Genomics Track. We approached the task with careful consideration of the specialized terminology and paid special attention to dealing with various forms of gene synonyms, so as to exhaustively locate the occurrences of the target gene. We extracted the words around the gene occurrences and used them to represent the gene for GO domain code annotation. As a classifier, we adopted a variant of k-Nearest Neighbor (kNN) with supervised term weighting schemes to improve the performance, making our method among the top-performing systems in the TREC official evaluation. Moreover, it is demonstrated that our proposed framework is successfully applied to another task of the Genomics Track, showing comparable results to the best performing system.

References

[1]

A. Dayanik, D. Fradkin, A. Genkin, P. Kantor, D. D. Lewis, D. Madigan, and V. Menkov. DIMACS at the TREC 2004 genomics track. In Proceedings of the 13th Text REtrieval Conference (TREC 2004), 2004.

[2]

Franca Debole and Fabrizio Sebastiani. Supervised term weighting for automated text categorization. In Proceedings of SAC-03, 18th ACM Symposium on Applied Computing, pages 784--788, 2003.

Digital Library

[3]

Sergei Egorov, Anton Yuryev, and Nikolai Daraselia. A simple and practical dictionary-based approach for identification of proteins in MEDLINE abstracts. Journal of the American Medical Informatics Association, 11(3):174--178, 2004.

[4]

Sumio Fujita. Revisiting again document length hypotheses TREC-2004 genomics track experiments at Patolis. In Proceedings of the 13th Text REtrieval Conference (TREC 2004), 2004.

[5]

Daniel Hanisch, Juliane Fluck, Heinz-Theodor Mevissen, and Ralf Zimmer. Playing biology's name game: Identifying protein names in scientific text. In Proceedings of the Pacific Symposium on Biocomputing (PSB), volume 8, pages 403--414, 2003.

[6]

William Hersh. Text retrieval conference (TREC) genomics pre-track workshop. In Proceedings of the 2nd ACM/IEEE-CS joint conference on Digital libraries, page 428, 2002.

Digital Library

[7]

William Hersh. Report on TREC 2003 genomics track first-year results and future plans. SIGIR Forum, 38(1):69--72, 2004.

Digital Library

[8]

W.R. Hersh, R.T. Bhuptiraju, L. Ross, A.M. Cohen, and D.F. Kraemer. TREC 2004 genomics track overview. In Proceedings of the 13th Text REtrieval Conference (TREC 2004), 2004.

[9]

Lynette Hirschman, Jong C. Park, Jun-ichi Tsujii, Limsoon Wong, and Cathy H. Wu. Accomplishments and challenges in literature data mining for biology. Bioinformatics, 18(12):1553--1561, 2002.

[10]

Julie Beth Lovins. Development of a stemming algorithm. Mechanical Translation and Computational Linguistics, 11:22--31, 1968.

[11]

Claire O'Donovan, Maria Jesus Martin, Alexandre Gattiker, Elisabeth Gasteiger, Amos Bairoch, and Rolf Apweiler. High-quality protein knowledge resource: SWISS-PROT and TrEMBL. Brief Bioinform, 3(3):275--284, 2002.

[12]

Kim D. Pruitt and Donna R. Maglott. RefSeq and LocusLink: NCBI gene-centered resources. Nucleic Acids Research, 29(1):137--140, 2001.

[13]

Gerard Salton and Michael J. McGill. Introduction to Modern Information Retrieval. McGraw-Hill, Inc., 1983.

Digital Library

[14]

Ariel S. Schwartz and Marti A. Hearst. A simple algorithm for identifying abbreviation definitions in biomedical text. In Proceedings of the Pacific Symposium on Biocomputing (PSB), volume 8, pages 451--462, 2003.

[15]

Burr Settles. Biomedical named entity recognition using conditional random fields and rich feature sets. In Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications (NLPBA), 2004.

Digital Library

[16]

Burr Settles and Mark Craven. Exploiting zone information, syntactic rules, and informative terms in gene ontology annotation of biomedical documents. In Proceedings of the 13th Text REtrieval Conference (TREC 2004), 2004.

[17]

Hagit Shatkay and Ronen Feldman. Mining the biomedical literature in the genomic era: An overview. Journal of Computational Biology, 10(6):821--856, 2003.

[18]

Yiming Yang and Xin Liu. A re-examination of text categorization methods. In Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, pages 42--49, 1999.

Digital Library

[19]

Yiming Yang and Jan O. Pedersen. A comparative study on feature selection in text categorization. In Proceedings of the 14th International Conference on Machine Learning, pages 412--420, 1997.

Digital Library

Cited By

Fan YArora CTreude C(2023)Stop Words for Processing Software Engineering Documents: Do they Matter?2023 IEEE/ACM 2nd International Workshop on Natural Language-Based Software Engineering (NLBSE)10.1109/NLBSE59153.2023.00016(40-47)Online publication date: May-2023
https://doi.org/10.1109/NLBSE59153.2023.00016
Sarica SLuo J(2021)Stopwords in technical language processingPLOS ONE10.1371/journal.pone.025493716:8(e0254937)Online publication date: 5-Aug-2021
https://doi.org/10.1371/journal.pone.0254937
Xu SLi LAn XHao LYang G(2021)An approach for detecting the commonality and specialty between scientific publications and patentsScientometrics10.1007/s11192-021-04085-9126:9(7445-7475)Online publication date: 5-Jul-2021
https://doi.org/10.1007/s11192-021-04085-9
Show More Cited By

Recommendations

Gene ontology annotation as text categorization: An empirical study

Gene ontology (GO) consists of three structured controlled vocabularies, i.e., GO domains, developed for describing attributes of gene products, and its annotation is crucial to provide a common gateway to access different model organism databases. This ...
Gene Ontology-Based Annotation Analysis and Categorization of Metabolic Pathways
SSDBM '07: Proceedings of the 19th International Conference on Scientific and Statistical Database Management

Functional characterizations of pathways provide new opportunities in defining, understanding, and comparing existing biological pathways, and in helping discover new ones in different organisms. In this paper, we present and evaluate computational ...
Improving disease gene prioritization using the semantic similarity of Gene Ontology terms

Motivation: Many hereditary human diseases are polygenic, resulting from sequence alterations in multiple genes. Genomic linkage and association studies are commonly performed for identifying disease-related genes. Such studies often yield lists of ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGIR '05: Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval

August 2005

708 pages

ISBN:1595930345

DOI:10.1145/1076034

General Chairs:
Ricardo Baeza-Yates
University of Chile, Chile
,
Nivio Ziviani
Federal University of Minas Gerais, Brazil
,
Program Chairs:
Gary Marchionini
University of North Carolina, USA
,
Alistair Moffat
University of Melbourne, Australia
,
John Tait
University of Sunderland, UK

Copyright © 2005 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGIR: ACM Special Interest Group on Information Retrieval

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 August 2005

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Article

Conference

SIGIR05

Sponsor:

SIGIR

SIGIR05: The 28th ACM/SIGIR International Symposium on Information Retrieval 2005

August 15 - 19, 2005

Salvador, Brazil

Acceptance Rates

Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

13
Total Citations
View Citations
907
Total Downloads

Downloads (Last 12 months)6
Downloads (Last 6 weeks)0

Reflects downloads up to 18 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Fan YArora CTreude C(2023)Stop Words for Processing Software Engineering Documents: Do they Matter?2023 IEEE/ACM 2nd International Workshop on Natural Language-Based Software Engineering (NLBSE)10.1109/NLBSE59153.2023.00016(40-47)Online publication date: May-2023
https://doi.org/10.1109/NLBSE59153.2023.00016
Sarica SLuo J(2021)Stopwords in technical language processingPLOS ONE10.1371/journal.pone.025493716:8(e0254937)Online publication date: 5-Aug-2021
https://doi.org/10.1371/journal.pone.0254937
Xu SLi LAn XHao LYang G(2021)An approach for detecting the commonality and specialty between scientific publications and patentsScientometrics10.1007/s11192-021-04085-9126:9(7445-7475)Online publication date: 5-Jul-2021
https://doi.org/10.1007/s11192-021-04085-9
Makrehchi MKamel M(2017)Extracting domain-specific stopwords for text classifiersIntelligent Data Analysis10.3233/IDA-15039021:1(39-62)Online publication date: 1-Jan-2017
https://dl.acm.org/doi/10.3233/IDA-150390
Park SChang JKihl T(2013)Application of Web Search Results for Document ClassificationFuture Information Communication Technology and Applications10.1007/978-94-007-6516-0_32(293-298)Online publication date: 25-May-2013
https://doi.org/10.1007/978-94-007-6516-0_32
Seki KUehara K(2009)Adaptive subjective triggers for opinionated document retrievalProceedings of the Second ACM International Conference on Web Search and Data Mining10.1145/1498759.1498805(25-33)Online publication date: 9-Feb-2009
https://dl.acm.org/doi/10.1145/1498759.1498805
Qi XDavison B(2009)Web page classificationACM Computing Surveys10.1145/1459352.145935741:2(1-31)Online publication date: 23-Feb-2009
https://dl.acm.org/doi/10.1145/1459352.1459357
Makrehchi MKamel M(2008)Automatic extraction of domain-specific stopwords from labeled documentsProceedings of the IR research, 30th European conference on Advances in information retrieval10.5555/1793274.1793304(222-233)Online publication date: 30-Mar-2008
https://dl.acm.org/doi/10.5555/1793274.1793304
Jin RSi LChan C(2008)A Bayesian framework for knowledge driven regression model in micro-array data analysisInternational Journal of Data Mining and Bioinformatics10.5555/1497170.14971742:3(250-267)Online publication date: 1-Sep-2008
https://dl.acm.org/doi/10.5555/1497170.1497174
Seki KMostafa J(2008)Gene ontology annotation as text categorizationInformation Processing and Management: an International Journal10.1016/j.ipm.2008.05.00344:5(1754-1770)Online publication date: 1-Sep-2008
https://dl.acm.org/doi/10.1016/j.ipm.2008.05.003
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten