article

Set-based vector model: An efficient approach for correlation-based ranking

Authors:

Wagner Meira, Jr.,

Berthier Ribeiro-NetoAuthors Info & Claims

ACM Transactions on Information Systems (TOIS), Volume 23, Issue 4

Pages 397 - 429

https://doi.org/10.1145/1095872.1095874

Published: 01 October 2005 Publication History

Abstract

This work presents a new approach for ranking documents in the vector space model. The novelty lies in two fronts. First, patterns of term co-occurrence are taken into account and are processed efficiently. Second, term weights are generated using a data mining technique called association rules. This leads to a new ranking mechanism called the set-based vector model. The components of our model are no longer index terms but index termsets, where a termset is a set of index terms. Termsets capture the intuition that semantically related terms appear close to each other in a document. They can be efficiently obtained by limiting the computation to small passages of text. Once termsets have been computed, the ranking is calculated as a function of the termset frequency in the document and its scarcity in the document collection. Experimental results show that the set-based vector model improves average precision for all collections and query types evaluated, while keeping computational costs small. For the 2-gigabyte TREC-8 collection, the set-based vector model leads to a gain in average precision figures of 14.7% and 16.4% for disjunctive and conjunctive queries, respectively, with respect to the standard vector space model. These gains increase to 24.9% and 30.0%, respectively, when proximity information is taken into account. Query processing times are larger but, on average, still comparable to those obtained with the standard vector model (increases in processing time varied from 30% to 300%). Our results suggest that the set-based vector model provides a correlation-based ranking formula that is effective with general collections and computationally practical.

References

[1]

Agrawal, R., Imielinski, T., and Swami, A. 1993. Mining association rules between sets of items in large databases. In Proceedings of the ACM SIGMOD International Conference Management of Data (Washington, D.C.). P. Buneman and S. Jajodia, Eds. ACM, New York, 207--216.]]

[2]

Agrawal, R. and Srikant, R. 1994. Fast algorithms for mining association rules. In Proceedings of the 20th International Conference on Very Large Data Bases (Santiago, Chile). J. B. Bocca, M. Jarke, and C. Zaniolo, Eds. Morgan-Kaufmann, San Matco, CA, 487--499.]]

[3]

Alsaffar, A. H., Deogun, J. S., Raghavan, V. V., and Sever, H. 2000. Enhancing concept-based retrieval based on minimal term sets. J. Intel. Inf. Syst. 14, 2--3 (March--June), 155--173.]]

[4]

Baeza-Yates, R. and Ribeiro-Neto, B. 1999. Modern Information Retrieval, 1st ed. Addison-Wesley-Longman, Wokingham, UK.]]

[5]

Berger, A. and Lafferty, J. 1999. Information retrieval as statistical translation. In Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (Berkeley, CA). ACM, New York, 222--229.]]

[6]

Billhardt, H., Borrajo, D., and Maojo, V. 2002. A context vector model for information retrieval. J. Amer. Soc. Inf. Sci. Tech. 53, 3, 236--249.]]

[7]

Bollmann-Sdorra, P., Hafez, A., and Raghavan, V. V. 2001. A theoretical framework for association mining based on the boolean retrieval model. In Data Warehousing and Knowledge Discovery: Third International Conference (Munich, Germany). Y. Kambayashi, W. Winiwarter, and M. Arikawa, Eds. Lecture Notes in Computer Science, vol. 2114. Springer-Verlag, New York, 21--30.]]

[8]

Bollmann-Sdorra, P. and Raghavan, V. V. 1998. On the necessity of term dependence in a query space for weighted retrieval. J. Amer. Soc. Inf. Sci. 49, 13 (Nov.), 1161--1168.]]

[9]

Buell, D. 1981. A general model of query processing in information retrieval systems. Inf. Proc. Manage. 17, 249--262.]]

[10]

Gao, J., Nie, J., Wu, G., and Cao, G. 2004. Dependence language model for information retrieval. In Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (Sheffield, South Yorkshire, UK). ACM, New York, 170--177.]]

[11]

Harper, D. J. and van Rijsbergen, C. J. 1978. An evaluation of feedback in document retrieval using co-occurrence data. J. Document. 34, 189--216.]]

[12]

Hawking, D. and Craswell, N. 2001. Overview of TREC-2001 web track. In Proceedings of the Tenth Text REtrieval Conference (TREC-2001). E. M. Voorhees and D. K. Harman, Eds. Department of Commerce, National Institute of Standards and Technology, Gaithersburg, MD, 61--67.]]

[13]

Hawking, D., Craswell, N., and Thistlewaite, P. B. 1998. Overview of TREC-7 very large collection track. In The Seventh Text REtrieval Conference (TREC-7). E. M. Voorhees and D. K. Harman, Eds. Department of Commerce, National Institute of Standards and Technology, Gaithersburg, MD, 91--104.]]

[14]

Hawking, D., Craswell, N., Thistlewaite, P. B., and Harman, D. 1999. Results and challenges in web search evaluation. Comput. Netw. 31, 11--16 (May), 1321--1330. Also in Proceedings of the 8th International World Wide Web Conference.]]

[15]

Kaszkeil, M. and Zobel, J. 1997. Passage retrieval revisited. In Proceedings of the 20th ACM SIGIR Conference on Research and Development in Information Retrieval (Philadelphia, PA). ACM, New York, 178--185.]]

[16]

Kaszkeil, M., Zobel, J., and Sacks-Davis, R. 1999. Efficient passage ranking for document databases. ACM Trans. Inf. Syst. (TOIS) 17, 4 (Oct.), 406--439.]]

[17]

Kim, M., Alsaffar, A. H., Deogun, J. S., and Raghavan, V. V. 2000. On modeling of concept based retrieval in generalized vector spaces. In Proceedings of the International Symposium on Methods of Intelligent Systems (Charlote, NC). Springer-Verlag, New York, 453--462.]]

[18]

Lafferty, J. and Zhai, C. 2001. Document language models, query models and risk minimization. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (New Orleans, LA). ACM, New York, 111--119.]]

[19]

Maron, M. and Kuhns, J. 1960. On relevance, probabilistic indexing and information retrieval. J. ACM 7, 216--244.]]

[20]

Nallapati, R. and Allan, J. 2002. Capturing term dependencies using a language model based on sentence trees. In Proceedings of the 11th International Conference on Information and Knowledge Management (McLean, VA). ACM, New York, 383--390.]]

[21]

Paice, C. D. 1984. Soft evaluation of boolean search queries in information retrieval systems. Inf. Tech. 3, 1, 33--41.]]

[22]

Ponte, J. M. and Croft, W. B. 1998. A language modeling approach to information retrieval. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (Melbourne, Australia). ACM, New York, 275--281.]]

[23]

Pôssas, B., Ziviani, N., and Meira, Jr., W. 2002a. Enhancing the set-based model using proximity information. In Proceedings of the 9th International Symposium on String Processing and Information Retrieval (Lisbon, Portugal). Lecture Notes in Computer Science. Springer-Verlag, New York, 104--116.]]

[24]

Pôssas, B., Ziviani, N., Meira, Jr., W., and Ribeiro-Neto, B. 2002b. Set-based model: A new approach for information retrieval. In Proceedings of the 25th ACM-SIGIR Conference on Research and Development in Information Retrieval. ACM Press, Tampere, Finland, 230--237.]]

[25]

Pôssas, B., Ziviani, N., Ribeiro-Neto, B., and Meira, Jr., W. 2004. Processing conjunctive and phrase queries with the set-based model. In Proceedings of the 11th International Symposium on String Processing and Information Retrieval (Padova, Italy). Lecture Notes in Computer Science. Springer-Verlag, New York, 171--183.]]

[26]

Raghavan, V. V. and Yu, C. T. 1979. Experiments on the determination of the relationships between terms. ACM Trans. Datab. Syst. 4, 2, 240--260.]]

[27]

Robertson, S. and Jones, K. S. 1976. Relevance weighting of search terms. J. Amer. Soc. Inf. Sci. 27, 129--146.]]

[28]

Robertson, S. and Walker, S. 1994. Some simple effective approximations to the 2-poisson model for probabilistic weighted retrieval. In Proceedings of the 17th ACM SIGIR Conference on Research and Development in Information Retrieval (Dublin, Ireland). Springer-Verlag, New York, 232--241.]]

[29]

Robertson, S. E., Walker, S., Jones, S., Hancock-Beaulieu, M. M., and Gatford, M. 1995. Okapi at trec-3. In Proceedings of the Third Text REtrieval Conference (TREC-3). E. M. Voorhees and D. K. Harman, Eds. Department of Commerce, National Institute of Standards and Technology, Gaithersburg, MD, 109--126.]]

[30]

Salton, G. 1971. The SMART Retrieval System---Experiments in Automatic Document Processing. Prentice Hall, Inc., Englewood Cliffs, NJ.]]

[31]

Salton, G. and Buckley, C. 1988. Term-weighting approaches in automatic retrieval. Inf. Proc. Manage. 24, 5, 513--523.]]

[32]

Salton, G., Buckley, C., and Yu, C. T. 1982. An evaluation of term dependencies models in information retrieval. In Proceedings of the 5th ACM-SIGIR Conference on Research and Development in Information Retrieval (Berlin, Germany). ACM, New York, 151--173.]]

[33]

Salton, G. and Lesk, M. E. 1968. Computer evaluation of indexing and text processing. J. ACM 15, 1 (Jan.), 8--36.]]

[34]

Salton, G. and McGill, M. J. 1983. Introduction to Modern Information Retrieval, 1st ed. McGraw-Hill, New York.]]

[35]

Salton, G. and Yang, C. S. 1973. On the specification of term values in automatic indexing. J. Document. 29, 351--372.]]

[36]

Silva, A., Veloso, E., Golgher, P., Ribeiro-Neto, B., Laender, A., and Ziviani, N. 1999. CobWeb---A crawler for the brazilian web. In Proceedings of the 6th String Processing and Information Retrieval Symposium (Cancun, Mexico). IEEE Computer Society, Los Alamitos, CA, 184--191.]]

[37]

Song, F. and Croft, W. B. 1999. A general language model for information retrieval. In Proceedings of the 8th International Conference on Information and Knowledge Management (Kansas City, MO). ACM, New York, 316--321.]]

[38]

Spink, A., Jansen, B. J., Wolfram, D., and Saracevic, T. 2002. From e-sex to e-commerce: Web search changes. IEEE Comput. 35, 3 (Apr.), 107--109.]]

[39]

Srikanth, M. and Srihari, R. 2002. Biterm language models for document retrieval. In Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (Tampere, Finland). ACM, New York, 425--426.]]

[40]

Turtle, H. and Croft, W. B. 1990. Inference networks for document retrieval. In Proceedings of the 13th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (Brussels, Belgium). ACM, New York, 1--24.]]

[41]

van Rijsbergen, C. J. 1977. A theoretical basis for the use of co-occurrence data in information retrieval. J. Document. 33, 106--119.]]

[42]

van Rijsbergen, C. J. 1979. Information Retrieval, 2nd ed. ButterWorths, London, UK.]]

[43]

Voorhees, E. and Harman, D. 1999. Overview of the Eighth Text Retrieval Conference (TREC 8). In Proceedings of the 8th Text REtrieval Conference (TREC-8). E. M. Voorhees and D. K. Harman, Eds. Department of Commerce, National Institute of Standards and Technology, Gaithersburg, MD, 1--23.]]

[44]

Witten, I. H., Moffat, A., and Bell, T. C. 1999. Managing Gigabytes: Compressing and Indexing Documents and Images, 2nd ed. Morgan-Kaufmann, San Francisco, CA.]]

[45]

Wong, S. K. M., Ziarko, W., Raghavan, V. V., and Wong, P. C. N. 1987. On modeling of information retrieval concepts in vector spaces. ACM Trans. Datab. Syst. 12, 2 (June), 299--321.]]

[46]

Wong, S. K. M., Ziarko, W., and Wong, P. C. N. 1985. Generalized vector space model in information retrieval. In Proceedings of the 8th ACM-SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, 18--25.]]

[47]

Yu, C. T. and Salton, G. 1976. Precision weighting---An effective automatic indexing method. J. ACM 23, 1 (Jan.), 76--88.]]

[48]

Zaki, M. J. 2000. Generating non-redundant association rules. In Proceedings of the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (Boston, MA). ACM, New York, 34--43.]]

[49]

Zobel, J. 1998. How reliable are the results of large-scale information retrieval experiments? In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (Melbourne, Australia). ACM, New York, 307--314.]]

[50]

Zobel, J., Moffat, A., Wilkinson, R., and Sacks-Davis, R. 1995. Efficient retrieval of partial documents. Inf. Proc. Manage. 31, 3, 361--377.]]

Cited By

Kalogeropoulos NIoannou DStathopoulos DMakris C(2024)On Embedding Implementations in Text Ranking and Classification Employing GraphsElectronics10.3390/electronics1310189713:10(1897)Online publication date: 12-May-2024
https://doi.org/10.3390/electronics13101897
Kalogeropoulos NDoukas IMakris CKanavos A(2020)A Graph-Based Extension for the Set-Based Model Implementing Algorithms Based on Important NodesArtificial Intelligence Applications and Innovations. AIAI 2020 IFIP WG 12.5 International Workshops10.1007/978-3-030-49190-1_13(143-154)Online publication date: 29-May-2020
https://doi.org/10.1007/978-3-030-49190-1_13
Furtado AZiviani N(2018)Information and data management at PUC-rio and UFMGProceedings of the VLDB Endowment10.14778/3229863.324049011:12(2114-2129)Online publication date: 1-Aug-2018
https://dl.acm.org/doi/10.14778/3229863.3240490
Show More Cited By

Index Terms

Set-based vector model: An efficient approach for correlation-based ranking
1. Information systems
  1. Information retrieval
    1. Retrieval models and ranking
  2. Information systems applications
    1. Data mining

Recommendations

Set-based model: a new approach for information retrieval
SIGIR '02: Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval

The objective of this paper is to present a new technique for computing term weights for index terms, which leads to a new ranking mechanism, referred to as set-based model. The components in our model are no longer terms, but termsets. The novelty is ...
Maximal termsets as a query structuring mechanism
CIKM '05: Proceedings of the 14th ACM international conference on Information and knowledge management

Search engines process queries conjunctively to restrict the size of the answer set. Further, it is not rare to observe a mismatch between the vocabulary used in the text of Web pages and the terms used to compose the Web queries. The combination of ...
ETARM: an efficient top-k association rule mining algorithm

Mining association rules plays an important role in data mining and knowledge discovery since it can reveal strong associations between items in databases. Nevertheless, an important problem with traditional association rule mining methods is that they ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Information Systems

ACM Transactions on Information Systems Volume 23, Issue 4

October 2005

135 pages

ISSN:1046-8188

EISSN:1558-2868

DOI:10.1145/1095872

Issue’s Table of Contents

Copyright © 2005 ACM.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 October 2005

Published in TOIS Volume 23, Issue 4

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

23
Total Citations
View Citations
1,303
Total Downloads

Downloads (Last 12 months)6
Downloads (Last 6 weeks)1

Reflects downloads up to 07 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Kalogeropoulos NIoannou DStathopoulos DMakris C(2024)On Embedding Implementations in Text Ranking and Classification Employing GraphsElectronics10.3390/electronics1310189713:10(1897)Online publication date: 12-May-2024
https://doi.org/10.3390/electronics13101897
Kalogeropoulos NDoukas IMakris CKanavos A(2020)A Graph-Based Extension for the Set-Based Model Implementing Algorithms Based on Important NodesArtificial Intelligence Applications and Innovations. AIAI 2020 IFIP WG 12.5 International Workshops10.1007/978-3-030-49190-1_13(143-154)Online publication date: 29-May-2020
https://doi.org/10.1007/978-3-030-49190-1_13
Furtado AZiviani N(2018)Information and data management at PUC-rio and UFMGProceedings of the VLDB Endowment10.14778/3229863.324049011:12(2114-2129)Online publication date: 1-Aug-2018
https://dl.acm.org/doi/10.14778/3229863.3240490
de Souza RDorneles C(2018)QSMatching vs Vector modelProceedings of the XIV Brazilian Symposium on Information Systems10.1145/3229345.3229374(1-8)Online publication date: 4-Jun-2018
https://dl.acm.org/doi/10.1145/3229345.3229374
Badawi DAltınçay H(2017)Termset weighting by adapting term weighting schemes to utilize cardinality statistics for binary text categorizationApplied Intelligence10.1007/s10489-017-0911-647:2(456-472)Online publication date: 1-Sep-2017
https://dl.acm.org/doi/10.1007/s10489-017-0911-6
Melucci M(2016)Utilising a statistical inequality for efficiently finding term setsInformation Processing and Management: an International Journal10.1016/j.ipm.2016.04.01152:6(1086-1121)Online publication date: 1-Nov-2016
https://dl.acm.org/doi/10.1016/j.ipm.2016.04.011
Melucci M(2015)Efficient Term Set Prediction Using the Bell-Wigner InequalityProceedings of the 22nd International Symposium on String Processing and Information Retrieval - Volume 930910.1007/978-3-319-23826-5_5(46-53)Online publication date: 1-Sep-2015
https://dl.acm.org/doi/10.1007/978-3-319-23826-5_5
Gomes Ferreira WAntônio Dos Santos WMacena Pereira De Souza BMatta Machado Zaidan TCardoso Brandão W(2015)Assessing the Efficiency of Suffix Stripping Approaches for Portuguese StemmingProceedings of the 22nd International Symposium on String Processing and Information Retrieval - Volume 930910.1007/978-3-319-23826-5_21(210-221)Online publication date: 1-Sep-2015
https://dl.acm.org/doi/10.1007/978-3-319-23826-5_21
Altınçay HErenel Z(2014)Ternary encoding based feature extraction for binary text classificationApplied Intelligence10.1007/s10489-014-0515-341:1(310-326)Online publication date: 1-Jul-2014
https://dl.acm.org/doi/10.1007/s10489-014-0515-3
Brandão WSantos RZiviani NMoura ESilva A(2014)Learning to expand queries using entitiesJournal of the Association for Information Science and Technology10.1002/asi.2308465:9(1870-1883)Online publication date: 1-Sep-2014
https://dl.acm.org/doi/10.1002/asi.23084
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents