ACM Home Page
Please provide us with feedback. Feedback
Incremental cluster-based retrieval using compressed cluster-skipping inverted files
Full text PdfPdf (516 KB)
Source
ACM Transactions on Information Systems (TOIS) archive
Volume 26 ,  Issue 3  (June 2008) table of contents
Article No. 15  
Year of Publication: 2008
ISSN:1046-8188
Authors
Ismail Sengor Altingovde  Bilkent University, Ankara, Turkey
Engin Demir  Bilkent University, Ankara, Turkey
Fazli Can  Bilkent University, Ankara, Turkey
Özgür Ulusoy  Bilkent University, Ankara, Turkey
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 31,   Downloads (12 Months): 135,   Citation Count: 0
Additional Information:

abstract   references   index terms   collaborative colleagues  

Tools and Actions: Review this Article  
Save this Article to a Binder    Display Formats: BibTex  EndNote ACM Ref   
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1361684.1361688
What is a DOI?

ABSTRACT

We propose a unique cluster-based retrieval (CBR) strategy using a new cluster-skipping inverted file for improving query processing efficiency. The new inverted file incorporates cluster membership and centroid information along with the usual document information into a single structure. In our incremental-CBR strategy, during query evaluation, both best(-matching) clusters and the best(-matching) documents of such clusters are computed together with a single posting-list access per query term. As we switch from term to term, the best clusters are recomputed and can dynamically change. During query-document matching, only relevant portions of the posting lists corresponding to the best clusters are considered and the rest are skipped. The proposed approach is essentially tailored for environments where inverted files are compressed, and provides substantial efficiency improvement while yielding comparable, or sometimes better, effectiveness figures. Our experiments with various collections show that the incremental-CBR strategy using a compressed cluster-skipping inverted file significantly improves CPU time efficiency, regardless of query length. The new compressed inverted file imposes an acceptable storage overhead in comparison to a typical inverted file. We also show that our approach scales well with the collection size.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
Altingovde, I. S., Can, F., and Ulusoy, Ö. 2006. Algorithms for within-cluster searches using inverted files. In Proceedings of the International Symposium on Computer and Information Sciences (ISCIS). 707--716.
 
2
Altingovde, I. S., Özcan, R., Öcalan, H. C., Can, F., and Ulusoy, Ö. 2007. Large-Scale cluster-based retrieval experiments on Turkish texts. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retreival. ACM Press, 891--892.
 
3
Anh, N. V., Kretser, O. de, and Moffat, A. 2001. Vector-Space ranking with effective early termination. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retreival. ACM Press, 35--42
 
4
Anh, V. N. and Moffat, A. 2005a. Inverted index compression using word-aligned binary codes. Inf. Retr. 8, 1, 151--166.
 
5
Anh, V. N. and Moffat, A. 2005b. Simplified similarity scoring using term ranks. In Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retreival. ACM Press, 226--233.
 
6
Anh, V. N. and Moffat, A. 2006. Pruned query evaluation using pre-computed impacts. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retreival. Seattle, WA. ACM Press, 372--379.
 
7
Baeza-Yates, R., Gionis, A., Junqueira, F., Murdock, V., Plachouras, V., and Silvestri, F. 2007. The impact of caching on search engines. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retreival. ACM Press, 183--190.
 
8
Blandford, D. and Blelloch, G. 2002. Index compression through document reordering. In Proceedings of the Data Compression Conference. IEEE, 342--351.
 
9
Brin, S. and Page, L. 1998. The anatomy of a large-scale hypertextual Web search engine. Comput. Netw. 30, 1--7, 107--117.
 
10
Brown, E. W. 1995. Fast evaluation of structured queries for information retrieval. In Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retreival. ACM Press, New York, 30--38.
 
11
Buckley, C. and Lewit, A. F. 1985. Optimization of inverted vector searches. In Proceedings of the 8th Annual International ACM SIGIR Conference on Research and Development in Information Retreival. ACM Press, New York, 97--110.
 
12
Buckley C. and Voorhees E. M. 2000. Evaluating evaluation measure stability. In Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retreival. ACM Press, New York, 33--40.
 
13
Cacheda, F. and Baeza-Yates, R. 2004. An optimistic model for searching Web directories. In Proceedings of the 26th European Conference on IR Research (ECIR). 364--377.
 
14
Cacheda, F., Carneiro, V., Guerrero, C., and Viña, Á. 2003. Optimization of restricted searches in Web directories using hybrid data structures. In Proceedings of the 25th European Conference on IR Research (ECIR). 436--451.
 
15
Cambazoglu, B. B. 2006. Models and algorithms for parallel text retrieval. Ph.D. thesis, Bilkent University, Ankara, Turkey.
 
16
Cambazoglu, B. B. and Aykanat, C. 2006. Performance of query processing implementations in ranking-based text retrieval systems using inverted indices. Inf. Process. Manage. 42, 4, 875--898.
 
17
Can, F. 1993. Incremental clustering for dynamic information processing. ACM Trans. Inf. Syst. 11, 2, 143--164.
 
18
Can, F. 1994. On the efficiency of best-match cluster searches. Inf. Process. Manage. 30, 3, 343--361.
 
19
Can, F., Altingovde, I. S., and Demir, E. 2004. Efficiency and effectiveness of query processing in cluster-based retrieval. Inf. Syst. 29, 8, 697--717.
 
20
Can, F. and Ozkarahan, E. A. 1990. Concepts and effectiveness of the cover-coefficient-based clustering methodology for text databases. ACM Trans. Database Syst. 15, 4, 483--517.
 
21
Croft, W. B. 1980. A model of cluster searching based on classification. Inf. Syst. 5, 3, 189--195.
 
22
Harman, D. 1992. Ranking algorithms. In Information Retrieval: Data Structures and Algorithms, W. B. Frakes and R. Baeza-Yates, eds. Prentice Hall, Englewood Cliffs, NJ, (Chapter 14), 363--392.
 
23
Jain, A. K. and Dubes, R. C. 1988. Algorithms for Clustering Data. Prentice-Hall, Englewood Cliffs, NJ.
 
24
Jardine, N. and van Rijsbergen, C. J. 1971. The use of hierarchical clustering in information retrieval. Inf. Storage Retr. 7, 217--240.
 
25
Lester, N., Moffat, A., Webber, W., and Zobel, J. 2005. Space-Limited ranked query evaluation using adaptive pruning. In Proceedings of the 6th International Conference on Web Information Systems Engineering, New York, 470--477.
 
26
Liu, X. and Croft, W. B. 2004. Cluster-Based retrieval using language models. In Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM Press, 186--193.
 
27
Long, X. and Suel, T. 2003. Optimized query execution in large search engines with global page ordering. In Proceedings of the 29th International Conference on Very Large Data Bases (VLDB). 129--140.
 
28
Moffat, A. and Zobel, J. 1996. Self-Indexing inverted files for fast text retrieval. ACM Trans. Inf. Syst. 14, 4, 349--379.
 
29
Murray, D. M. 1972. Document retrieval based on clustered files. Ph.D. thesis, Cornell University. Also Rep. ISR-20. National Science Foundation, National Library of Medicine.
 
30
Nuray, R. and Can, F. 2006. Automatic ranking of information retrieval systems using data fusion. Inf. Process. Manage. 42, 3, 595--614.
 
31
Persin, M. 1994. Document filtering for fast ranking. In Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM Press, 339--348.
 
32
Persin, M., Zobel, J., and Sacks-Davis, R. 1996. Filtered document retrieval with frequency-sorted indexes. J. Amer. Soc. Inf. Sci. 47, 10, 749--764.
 
33
Salton, G. 1975. Dynamic Information and Library Processing. Prentice-Hall, Englewood Cliffs, NJ.
 
34
Salton, G. 1989. Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer. Addison-Wesley, Reading, MA.
 
35
Salton, G. and Buckley, C. 1988. Term weighting approaches in automatic text retrieval. Inf. Process. Manage. 24, 5, 513-523.
 
36
Salton, G. and McGill, M. J. 1983. Introduction to Modern Information Retrieval. McGraw Hill, New York.
 
37
Silvestri, F., Orlando, S., and Perego, R. 2004. Assigning identifiers to documents to enhance the clustering property of fulltext indexes. In Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM Press, 305--312.
 
38
Silvestri, F. 2007. Sorting out the document identifier assignment problem. In Proceedings of the 29th European Conference on IR Research (ECIR). 101--112.
 
39
Stanford University. 2007. Webbase project Web site. www-diglib.stanford.edu/~testbed/doc2/WebBase.
 
40
Strohman, T. and Croft, W. B. 2007. Efficient document retrieval in main memory. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM Press, 175--182.
 
41
Trec 2006. Trec. Web site. http://trec.nist.gov.
 
42
Tombros, A. 2002. The effectiveness of query-based hierarchic clustering of documents for information retrieval. Ph.D. thesis, University of Glasgow, Glasgow, UK.
 
43
Van Rijsbergen, C. J. 1979. Information Retrieval, 2nd ed. Butterworths, London.
 
44
Voorhees, E. M. 1985. Cluster hypothesis revisited. In Proceedings of the 8th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM Press, 188--196.
 
45
Voorhees, E. M. 1986a. The effectiveness and efficiency of agglomerative hierarchical clustering in document retrieval. Ph.D. thesis, Cornell University, Ithaca, NY.
 
46
Voorhees, E. M. 1986b. The efficiency of inverted index and cluster searches. In Proceedings of the 9th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM Press, New York. 164--174.
 
47
Willett, P. 1988. Recent trends in hierarchical document clustering: A critical review. Inf. Process. Manage. 24, 5, 577--597.
 
48
Witten, I. H., Moffat, A., and Bell, T. C. 1994. Managing Gigabytes Compressing and Indexing Documents and Images. Van Nostrand Reinhold, New York.
 
49
Zettair 2007. The Zettair search engine. http://www.seg.rmit.edu.au/zettair/.
 
50
Zobel, J. and Moffat, A. 2006. Inverted files for text search engines. ACM Comput. Surv. 38, 2, 1--56.

Collaborative Colleagues:
Ismail Sengor Altingovde: colleagues
Engin Demir: colleagues
Fazli Can: colleagues
Özgür Ulusoy: colleagues