skip to main content
10.1145/1526709.1526770acmconferencesArticle/Chapter ViewAbstractPublication PagesthewebconfConference Proceedingsconference-collections
research-article

Compressed web indexes

Published: 20 April 2009 Publication History

Abstract

Web search engines use indexes to efficiently retrieve pages containing specified query terms, as well as pages linking to specified pages. The problem of compressed indexes that permit such fast retrieval has a long history. We consider the problem: assuming that the terms in (or links to) a page are generated from a probability distribution, how well compactly can we build such indexes that allow fast retrieval? Of particular interest is the case when the probability distribution is Zipfian (or a similar power law), since these are the distributions that arise on the web. We obtain sharp bounds on the space requirement of Boolean indexes for text documents that follow Zipf's law. In the process we develop a general technique that applies to any probability distribution, not necessarily a power law; this is the first analysis of compression in indexes under arbitrary distributions. Our bounds lead to quantitative versions of rules of thumb that are folklore in indexing. Our experiments on several document collections show that the distribution of terms appears to follow a double-Pareto law rather than Zipf's law. Despite widely varying sets of documents, the index sizes observed in the experiments conform well to our theoretical predictions.

References

[1]
T. M. Apostol. Introduction to Analytic Number Theory. Springer-Verlag, 1976.
[2]
R. Baeza-Yates and G. Navarro. Block-addressing indices for approximate text retrieval. Journal of the American Society for Information Science, 51(1):69--82, 2000.
[3]
R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. Addison Wesley, 1999.
[4]
A.-L. Barabasi. Linked: How Everything is Connected to Everything Else and What It Means. Penguin Group, 2003.
[5]
D. Bladford and G. Blelloch. Index compression through document reordering. In Proceedings of the Data Compression Conference, pages 342--351, 2002.
[6]
P. Boldi and S. Vigna. The Webgraph framework i: Compression techniques. In Proceedings of the 13th International Conference on World Wide Web, pages 595--602, 2004.
[7]
P. Boldi and S. Vigna. The Webgraph framework ii: Codes for the world-wide web. In Data Compression Conference, 2004.
[8]
A. Gelbukh and G. Sidorov. Zipf and Heaps laws' coefficients depend on language. In Proceedings of the 2nd International Conference on Computational Linguistics and Intelligent Text Processing, pages 332--335, 2001.
[9]
L. Q. Ha, E. I. Sicilia-Garcia, J. Ming, and F. J. Smith. Extension of Zipf's law to word and character n-grams for English and Chinese. Computational Linguistics and Chinese Language Processing, 8(1):77--102, 2003.
[10]
H. S. Heaps. Information Retrieval: Computational and Theoretical Aspects. Academic Press, New York, 1978.
[11]
R. Kumar, P. Raghavan, S. Rajagopalan, and A. Tomkins. Trawling the Web for emerging cyber-communities. Computer Networks, 31(11--16):1481--1493, 1999.
[12]
W. Li. Random texts exhibit Zipf's-law-like word frequency distribution. IEEE Transactions on Information Theory, 38(6):1842--1845, 1992.
[13]
B. Mandelbrot. An information theory of the statistical structure of language. In W. Jackson, editor, Communication Theory, pages 486--502. Academic Press, 1953.
[14]
C. D. Manning, P. Raghavan, and H. Schutze. Introduction to Information Retrieval. Cambridge University Press, 2008.
[15]
C. D. Manning and H. Sch¨utze. Foundations of Statistical Natural Language Processing. MIT Press, Cambridge, MA, 1999.
[16]
M. Mitzenmacher. Dynamic models for file sizes and double Pareto distributions. Internet Mathematics, 1(3):305--333, 2003.
[17]
M. Molloy and B. Reed. Graph Coloring and the Probabilistic Method. Springer-Verlag, 2002.
[18]
M. Newman, A.-L. Barabasi, and D. J. Watts. The Structure and Dynamics of Networks. Princeton University Press, 2006.
[19]
W. J. Reed and M. Jorgensen. The double Pareto-lognormal distribution - A new parametric model for size distributions. Communications in Statistics: Theory and Methods, 33(8):1733--1753, 2004.
[20]
W.-Y. Shieh, T.-F. Chen, J. J.-J. Shann, and C.-P. Chung. Inverted file compression through document identifier reassignment. Information Processing and Management, 39(1):117--131, 2003.
[21]
F. Silvestri, R. Perego, and S. Orlando. Assigning document identifiers to enhance compressibility of web search indexes. In Proceedings of the Symposium on Applied Computing, pages 600--605, 2004.
[22]
H. A. Simon. On a class of skew distribution functions. Biometrika, 42:425--440, 1955.
[23]
D. C. van Leijenhorst and T. P. van der Weide. A formal derivation of Heap's law. Information Sciences, 170:263--272, 2005.
[24]
D. Watts. Six Degrees: The Science of a Connected Age. W. W. Norton, 2003.
[25]
H. E. Williams and J. Zobel. Searchable words on the web. International Journal on Digital Libraries, 5(2):99--105, 2005.
[26]
I. H. Witten and T. C. Bell. Source models for natural language text. International Journal Man-Machine Studies, 32(5):545--579, 1990.
[27]
I. H. Witten, A. Moffat, and T. C. Bell. Managing Gigabytes: Compressing and Indexing Documents and Images. Morgan Kaufmann, 1999.
[28]
G. K. Zipf. Human Behavior and the Principle of Least Effort. Addison-Wesley, Cambridge MA, 1949

Cited By

View all
  • (2023)A Unified Formulation for the Frequency Distribution of Word Frequencies using the Inverse Zipf's LawProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3539618.3591942(1776-1780)Online publication date: 19-Jul-2023
  • (2018)An Energy- and Space-Efficient Object Representation Model in Pervasive Computing SystemsIEEE Systems Journal10.1109/JSYST.2016.257628812:2(1456-1466)Online publication date: Jun-2018
  • (2018)Selection Optimization of Bloom Filter-Based Index Services in Ubiquitous Embedded SystemsWeb Services – ICWS 201810.1007/978-3-319-94289-6_15(231-245)Online publication date: 19-Jun-2018
  • Show More Cited By

Index Terms

  1. Compressed web indexes

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    WWW '09: Proceedings of the 18th international conference on World wide web
    April 2009
    1280 pages
    ISBN:9781605584874
    DOI:10.1145/1526709

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 20 April 2009

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. compression
    2. double-pareto
    3. index size
    4. power law

    Qualifiers

    • Research-article

    Conference

    WWW '09
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)4
    • Downloads (Last 6 weeks)2
    Reflects downloads up to 13 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)A Unified Formulation for the Frequency Distribution of Word Frequencies using the Inverse Zipf's LawProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3539618.3591942(1776-1780)Online publication date: 19-Jul-2023
    • (2018)An Energy- and Space-Efficient Object Representation Model in Pervasive Computing SystemsIEEE Systems Journal10.1109/JSYST.2016.257628812:2(1456-1466)Online publication date: Jun-2018
    • (2018)Selection Optimization of Bloom Filter-Based Index Services in Ubiquitous Embedded SystemsWeb Services – ICWS 201810.1007/978-3-319-94289-6_15(231-245)Online publication date: 19-Jun-2018
    • (2017)On the Power Laws of LanguageProceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3077136.3080821(385-394)Online publication date: 7-Aug-2017
    • (2015)A Bloom Filter-Based Index for Distributed Storage SystemsDistributed Computing and Artificial Intelligence, 12th International Conference10.1007/978-3-319-19638-1_34(293-301)Online publication date: 2015
    • (2014)Indexing and Compressing TextEncyclopedia of Information Science and Technology, Third Edition10.4018/978-1-4666-5888-2.ch173(1800-1808)Online publication date: 31-Jul-2014
    • (2014)CAPS: A cloud-assisted approach to handle spikes in peer-to-peer web searchPeer-to-Peer Networking and Applications10.1007/s12083-014-0322-y9:1(193-208)Online publication date: 19-Dec-2014
    • (2012)Improved retrieval effectiveness by efficient combination of term proximity and zone scoring: A simulation-based evaluationSimulation Modelling Practice and Theory10.1016/j.simpat.2011.12.00222(74-91)Online publication date: Mar-2012
    • (2012)A fast indexing algorithm optimization with user behavior patternProceedings of the 2012 international conference on Pervasive Computing and the Networked World10.1007/978-3-642-37015-1_52(592-605)Online publication date: 28-Nov-2012
    • (2012)Positional Data Organization and Compression in Web Inverted IndexesDatabase and Expert Systems Applications10.1007/978-3-642-32600-4_31(422-429)Online publication date: 2012
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media