skip to main content
article

Set-based vector model: An efficient approach for correlation-based ranking

Published: 01 October 2005 Publication History

Abstract

This work presents a new approach for ranking documents in the vector space model. The novelty lies in two fronts. First, patterns of term co-occurrence are taken into account and are processed efficiently. Second, term weights are generated using a data mining technique called association rules. This leads to a new ranking mechanism called the set-based vector model. The components of our model are no longer index terms but index termsets, where a termset is a set of index terms. Termsets capture the intuition that semantically related terms appear close to each other in a document. They can be efficiently obtained by limiting the computation to small passages of text. Once termsets have been computed, the ranking is calculated as a function of the termset frequency in the document and its scarcity in the document collection. Experimental results show that the set-based vector model improves average precision for all collections and query types evaluated, while keeping computational costs small. For the 2-gigabyte TREC-8 collection, the set-based vector model leads to a gain in average precision figures of 14.7% and 16.4% for disjunctive and conjunctive queries, respectively, with respect to the standard vector space model. These gains increase to 24.9% and 30.0%, respectively, when proximity information is taken into account. Query processing times are larger but, on average, still comparable to those obtained with the standard vector model (increases in processing time varied from 30% to 300%). Our results suggest that the set-based vector model provides a correlation-based ranking formula that is effective with general collections and computationally practical.

References

[1]
Agrawal, R., Imielinski, T., and Swami, A. 1993. Mining association rules between sets of items in large databases. In Proceedings of the ACM SIGMOD International Conference Management of Data (Washington, D.C.). P. Buneman and S. Jajodia, Eds. ACM, New York, 207--216.]]
[2]
Agrawal, R. and Srikant, R. 1994. Fast algorithms for mining association rules. In Proceedings of the 20th International Conference on Very Large Data Bases (Santiago, Chile). J. B. Bocca, M. Jarke, and C. Zaniolo, Eds. Morgan-Kaufmann, San Matco, CA, 487--499.]]
[3]
Alsaffar, A. H., Deogun, J. S., Raghavan, V. V., and Sever, H. 2000. Enhancing concept-based retrieval based on minimal term sets. J. Intel. Inf. Syst. 14, 2--3 (March--June), 155--173.]]
[4]
Baeza-Yates, R. and Ribeiro-Neto, B. 1999. Modern Information Retrieval, 1st ed. Addison-Wesley-Longman, Wokingham, UK.]]
[5]
Berger, A. and Lafferty, J. 1999. Information retrieval as statistical translation. In Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (Berkeley, CA). ACM, New York, 222--229.]]
[6]
Billhardt, H., Borrajo, D., and Maojo, V. 2002. A context vector model for information retrieval. J. Amer. Soc. Inf. Sci. Tech. 53, 3, 236--249.]]
[7]
Bollmann-Sdorra, P., Hafez, A., and Raghavan, V. V. 2001. A theoretical framework for association mining based on the boolean retrieval model. In Data Warehousing and Knowledge Discovery: Third International Conference (Munich, Germany). Y. Kambayashi, W. Winiwarter, and M. Arikawa, Eds. Lecture Notes in Computer Science, vol. 2114. Springer-Verlag, New York, 21--30.]]
[8]
Bollmann-Sdorra, P. and Raghavan, V. V. 1998. On the necessity of term dependence in a query space for weighted retrieval. J. Amer. Soc. Inf. Sci. 49, 13 (Nov.), 1161--1168.]]
[9]
Buell, D. 1981. A general model of query processing in information retrieval systems. Inf. Proc. Manage. 17, 249--262.]]
[10]
Gao, J., Nie, J., Wu, G., and Cao, G. 2004. Dependence language model for information retrieval. In Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (Sheffield, South Yorkshire, UK). ACM, New York, 170--177.]]
[11]
Harper, D. J. and van Rijsbergen, C. J. 1978. An evaluation of feedback in document retrieval using co-occurrence data. J. Document. 34, 189--216.]]
[12]
Hawking, D. and Craswell, N. 2001. Overview of TREC-2001 web track. In Proceedings of the Tenth Text REtrieval Conference (TREC-2001). E. M. Voorhees and D. K. Harman, Eds. Department of Commerce, National Institute of Standards and Technology, Gaithersburg, MD, 61--67.]]
[13]
Hawking, D., Craswell, N., and Thistlewaite, P. B. 1998. Overview of TREC-7 very large collection track. In The Seventh Text REtrieval Conference (TREC-7). E. M. Voorhees and D. K. Harman, Eds. Department of Commerce, National Institute of Standards and Technology, Gaithersburg, MD, 91--104.]]
[14]
Hawking, D., Craswell, N., Thistlewaite, P. B., and Harman, D. 1999. Results and challenges in web search evaluation. Comput. Netw. 31, 11--16 (May), 1321--1330. Also in Proceedings of the 8th International World Wide Web Conference.]]
[15]
Kaszkeil, M. and Zobel, J. 1997. Passage retrieval revisited. In Proceedings of the 20th ACM SIGIR Conference on Research and Development in Information Retrieval (Philadelphia, PA). ACM, New York, 178--185.]]
[16]
Kaszkeil, M., Zobel, J., and Sacks-Davis, R. 1999. Efficient passage ranking for document databases. ACM Trans. Inf. Syst. (TOIS) 17, 4 (Oct.), 406--439.]]
[17]
Kim, M., Alsaffar, A. H., Deogun, J. S., and Raghavan, V. V. 2000. On modeling of concept based retrieval in generalized vector spaces. In Proceedings of the International Symposium on Methods of Intelligent Systems (Charlote, NC). Springer-Verlag, New York, 453--462.]]
[18]
Lafferty, J. and Zhai, C. 2001. Document language models, query models and risk minimization. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (New Orleans, LA). ACM, New York, 111--119.]]
[19]
Maron, M. and Kuhns, J. 1960. On relevance, probabilistic indexing and information retrieval. J. ACM 7, 216--244.]]
[20]
Nallapati, R. and Allan, J. 2002. Capturing term dependencies using a language model based on sentence trees. In Proceedings of the 11th International Conference on Information and Knowledge Management (McLean, VA). ACM, New York, 383--390.]]
[21]
Paice, C. D. 1984. Soft evaluation of boolean search queries in information retrieval systems. Inf. Tech. 3, 1, 33--41.]]
[22]
Ponte, J. M. and Croft, W. B. 1998. A language modeling approach to information retrieval. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (Melbourne, Australia). ACM, New York, 275--281.]]
[23]
Pôssas, B., Ziviani, N., and Meira, Jr., W. 2002a. Enhancing the set-based model using proximity information. In Proceedings of the 9th International Symposium on String Processing and Information Retrieval (Lisbon, Portugal). Lecture Notes in Computer Science. Springer-Verlag, New York, 104--116.]]
[24]
Pôssas, B., Ziviani, N., Meira, Jr., W., and Ribeiro-Neto, B. 2002b. Set-based model: A new approach for information retrieval. In Proceedings of the 25th ACM-SIGIR Conference on Research and Development in Information Retrieval. ACM Press, Tampere, Finland, 230--237.]]
[25]
Pôssas, B., Ziviani, N., Ribeiro-Neto, B., and Meira, Jr., W. 2004. Processing conjunctive and phrase queries with the set-based model. In Proceedings of the 11th International Symposium on String Processing and Information Retrieval (Padova, Italy). Lecture Notes in Computer Science. Springer-Verlag, New York, 171--183.]]
[26]
Raghavan, V. V. and Yu, C. T. 1979. Experiments on the determination of the relationships between terms. ACM Trans. Datab. Syst. 4, 2, 240--260.]]
[27]
Robertson, S. and Jones, K. S. 1976. Relevance weighting of search terms. J. Amer. Soc. Inf. Sci. 27, 129--146.]]
[28]
Robertson, S. and Walker, S. 1994. Some simple effective approximations to the 2-poisson model for probabilistic weighted retrieval. In Proceedings of the 17th ACM SIGIR Conference on Research and Development in Information Retrieval (Dublin, Ireland). Springer-Verlag, New York, 232--241.]]
[29]
Robertson, S. E., Walker, S., Jones, S., Hancock-Beaulieu, M. M., and Gatford, M. 1995. Okapi at trec-3. In Proceedings of the Third Text REtrieval Conference (TREC-3). E. M. Voorhees and D. K. Harman, Eds. Department of Commerce, National Institute of Standards and Technology, Gaithersburg, MD, 109--126.]]
[30]
Salton, G. 1971. The SMART Retrieval System---Experiments in Automatic Document Processing. Prentice Hall, Inc., Englewood Cliffs, NJ.]]
[31]
Salton, G. and Buckley, C. 1988. Term-weighting approaches in automatic retrieval. Inf. Proc. Manage. 24, 5, 513--523.]]
[32]
Salton, G., Buckley, C., and Yu, C. T. 1982. An evaluation of term dependencies models in information retrieval. In Proceedings of the 5th ACM-SIGIR Conference on Research and Development in Information Retrieval (Berlin, Germany). ACM, New York, 151--173.]]
[33]
Salton, G. and Lesk, M. E. 1968. Computer evaluation of indexing and text processing. J. ACM 15, 1 (Jan.), 8--36.]]
[34]
Salton, G. and McGill, M. J. 1983. Introduction to Modern Information Retrieval, 1st ed. McGraw-Hill, New York.]]
[35]
Salton, G. and Yang, C. S. 1973. On the specification of term values in automatic indexing. J. Document. 29, 351--372.]]
[36]
Silva, A., Veloso, E., Golgher, P., Ribeiro-Neto, B., Laender, A., and Ziviani, N. 1999. CobWeb---A crawler for the brazilian web. In Proceedings of the 6th String Processing and Information Retrieval Symposium (Cancun, Mexico). IEEE Computer Society, Los Alamitos, CA, 184--191.]]
[37]
Song, F. and Croft, W. B. 1999. A general language model for information retrieval. In Proceedings of the 8th International Conference on Information and Knowledge Management (Kansas City, MO). ACM, New York, 316--321.]]
[38]
Spink, A., Jansen, B. J., Wolfram, D., and Saracevic, T. 2002. From e-sex to e-commerce: Web search changes. IEEE Comput. 35, 3 (Apr.), 107--109.]]
[39]
Srikanth, M. and Srihari, R. 2002. Biterm language models for document retrieval. In Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (Tampere, Finland). ACM, New York, 425--426.]]
[40]
Turtle, H. and Croft, W. B. 1990. Inference networks for document retrieval. In Proceedings of the 13th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (Brussels, Belgium). ACM, New York, 1--24.]]
[41]
van Rijsbergen, C. J. 1977. A theoretical basis for the use of co-occurrence data in information retrieval. J. Document. 33, 106--119.]]
[42]
van Rijsbergen, C. J. 1979. Information Retrieval, 2nd ed. ButterWorths, London, UK.]]
[43]
Voorhees, E. and Harman, D. 1999. Overview of the Eighth Text Retrieval Conference (TREC 8). In Proceedings of the 8th Text REtrieval Conference (TREC-8). E. M. Voorhees and D. K. Harman, Eds. Department of Commerce, National Institute of Standards and Technology, Gaithersburg, MD, 1--23.]]
[44]
Witten, I. H., Moffat, A., and Bell, T. C. 1999. Managing Gigabytes: Compressing and Indexing Documents and Images, 2nd ed. Morgan-Kaufmann, San Francisco, CA.]]
[45]
Wong, S. K. M., Ziarko, W., Raghavan, V. V., and Wong, P. C. N. 1987. On modeling of information retrieval concepts in vector spaces. ACM Trans. Datab. Syst. 12, 2 (June), 299--321.]]
[46]
Wong, S. K. M., Ziarko, W., and Wong, P. C. N. 1985. Generalized vector space model in information retrieval. In Proceedings of the 8th ACM-SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, 18--25.]]
[47]
Yu, C. T. and Salton, G. 1976. Precision weighting---An effective automatic indexing method. J. ACM 23, 1 (Jan.), 76--88.]]
[48]
Zaki, M. J. 2000. Generating non-redundant association rules. In Proceedings of the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (Boston, MA). ACM, New York, 34--43.]]
[49]
Zobel, J. 1998. How reliable are the results of large-scale information retrieval experiments? In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (Melbourne, Australia). ACM, New York, 307--314.]]
[50]
Zobel, J., Moffat, A., Wilkinson, R., and Sacks-Davis, R. 1995. Efficient retrieval of partial documents. Inf. Proc. Manage. 31, 3, 361--377.]]

Cited By

View all
  • (2024)On Embedding Implementations in Text Ranking and Classification Employing GraphsElectronics10.3390/electronics1310189713:10(1897)Online publication date: 12-May-2024
  • (2020)A Graph-Based Extension for the Set-Based Model Implementing Algorithms Based on Important NodesArtificial Intelligence Applications and Innovations. AIAI 2020 IFIP WG 12.5 International Workshops10.1007/978-3-030-49190-1_13(143-154)Online publication date: 29-May-2020
  • (2018)Information and data management at PUC-rio and UFMGProceedings of the VLDB Endowment10.14778/3229863.324049011:12(2114-2129)Online publication date: 1-Aug-2018
  • Show More Cited By

Index Terms

  1. Set-based vector model: An efficient approach for correlation-based ranking

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Transactions on Information Systems
      ACM Transactions on Information Systems  Volume 23, Issue 4
      October 2005
      135 pages
      ISSN:1046-8188
      EISSN:1558-2868
      DOI:10.1145/1095872
      Issue’s Table of Contents

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 01 October 2005
      Published in TOIS Volume 23, Issue 4

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. Information retrieval models
      2. association rule mining
      3. correlation-based ranking
      4. data mining
      5. weighting index term co-occurrences

      Qualifiers

      • Article

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)6
      • Downloads (Last 6 weeks)1
      Reflects downloads up to 07 Jan 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)On Embedding Implementations in Text Ranking and Classification Employing GraphsElectronics10.3390/electronics1310189713:10(1897)Online publication date: 12-May-2024
      • (2020)A Graph-Based Extension for the Set-Based Model Implementing Algorithms Based on Important NodesArtificial Intelligence Applications and Innovations. AIAI 2020 IFIP WG 12.5 International Workshops10.1007/978-3-030-49190-1_13(143-154)Online publication date: 29-May-2020
      • (2018)Information and data management at PUC-rio and UFMGProceedings of the VLDB Endowment10.14778/3229863.324049011:12(2114-2129)Online publication date: 1-Aug-2018
      • (2018)QSMatching vs Vector modelProceedings of the XIV Brazilian Symposium on Information Systems10.1145/3229345.3229374(1-8)Online publication date: 4-Jun-2018
      • (2017)Termset weighting by adapting term weighting schemes to utilize cardinality statistics for binary text categorizationApplied Intelligence10.1007/s10489-017-0911-647:2(456-472)Online publication date: 1-Sep-2017
      • (2016)Utilising a statistical inequality for efficiently finding term setsInformation Processing and Management: an International Journal10.1016/j.ipm.2016.04.01152:6(1086-1121)Online publication date: 1-Nov-2016
      • (2015)Efficient Term Set Prediction Using the Bell-Wigner InequalityProceedings of the 22nd International Symposium on String Processing and Information Retrieval - Volume 930910.1007/978-3-319-23826-5_5(46-53)Online publication date: 1-Sep-2015
      • (2015)Assessing the Efficiency of Suffix Stripping Approaches for Portuguese StemmingProceedings of the 22nd International Symposium on String Processing and Information Retrieval - Volume 930910.1007/978-3-319-23826-5_21(210-221)Online publication date: 1-Sep-2015
      • (2014)Ternary encoding based feature extraction for binary text classificationApplied Intelligence10.1007/s10489-014-0515-341:1(310-326)Online publication date: 1-Jul-2014
      • (2014)Learning to expand queries using entitiesJournal of the Association for Information Science and Technology10.1002/asi.2308465:9(1870-1883)Online publication date: 1-Sep-2014
      • Show More Cited By

      View Options

      Login options

      Full Access

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media