skip to main content
10.1145/375551.383213acmconferencesArticle/Chapter ViewAbstractPublication PagespodsConference Proceedingsconference-collections
Article

On the effects of dimensionality reduction on high dimensional similarity search

Published:01 May 2001Publication History

ABSTRACT

The dimensionality curse has profound effects on the effectiveness of high-dimensional similarity indexing from the performance perspective. One of the well known techniques for improving the indexing performance is the method of dimensionality reduction. In this technique, the data is transformed to a lower dimensional space by finding a new axis-system in which most of the data variance is preserved in a few dimensions. This reduction may also have a positive effect on the quality of similarity for certain data domains such as text. For other domains, it may lead to loss of information and degradation of search quality. Recent research indicates that the improvement for the text domain is caused by the re-enforcement of the semantic concepts in the data. In this paper, we provide an intuitive model of the effects of dimensionality reduction on arbitrary high dimensional problems. We provide an effective diagnosis of the causality behind the qualitative effects of dimensionality reduction on a given data set. The analysis suggests that these effects are very data dependent. Our analysis also indicates that currently accepted techniques of picking the reduction which results in the least loss of information are useful for maximizing precision and recall, but are not necessarily optimum from a qualitative perspective. We demonstrate that by making simple changes to the implementation details of dimensionality reduction techniques, we can considerably improve the quality of similarity search.

References

  1. 1.C. C. Aggarwal, A. Hinneburg, D. A. Keim. On the Surprising Behavior of Distance Metrics in High Dimensional Space. ICDT Conference Proceedings, 2001.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. 2.C. C. Aggarwal, P. S. Yu. Finding Generalized Projected Clusters in High Dimensional Spaces. ACM SIGMOD Conference Proceedings, 2000.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. 3.C. C. Aggarwal, P. S. Yu. The IGrid Index: Reversing the Dimensionality Curse for Similarity Indexing in High Dimensional Space. ACM SIGKDD Conference Proceedings, 2000.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. 4.S. Berchtold, D. A. Keim, H.-P. Kriegel. The X-Tree: An Index Structure for High Dimensional Data. VLDB Conference Proceedings, pages 28-39, September 1996.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. 5.K. Beyer et al. When is Nearest Neighbors Meaningful? ICDT Conference, 1999.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. 6.K. Chakrabarti, S. Mehrotra. Local Dimensionality Reduction: A New Approach to Indexing High Dimensional Spaces. VLDB Conference Proceedings, 2000.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. 7.S. Deerwester et al. Indexing by Latent Semantic Analysis. Journal of the American Society for Information Science, 41(6): pages 391-407, 1990.]]Google ScholarGoogle ScholarCross RefCross Ref
  8. 8.C. Ding. A Similarity Based Probability Model for Latent Semantic Indexing. ACM SIGIR Conference Proceedings, 1999.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. 9.C. Faloutsos, K.-I. Lin. FastMap: A Fast Algorithm for Indexing, Data-Mining and Visualization of Traditional and Multimedia Datasets. ACM SIGMOD Conference Proceedings, 1995.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. 10.A. Guttman. R-Trees: A Dynamic Index Structure for Spatial Searching. ACM SIGMOD Conference Proceedings 1984.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. 11.A. Hinneburg, C. C. Aggarwal, D. A. Keim. What is the Nearest Neighbor in High Dimensional Spaces? VLDB Conference Proceedings, 2000.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. 12.I. T. Jolliffe. Principal Component Analysis, Springer-Verlag, New York, 1986.]]Google ScholarGoogle ScholarCross RefCross Ref
  13. 13.J. Kleinberg, A. Tomkins. Applications of Linear Algebra in Information Retrieval and Hypertext Analysis. ACM PODS Conference Proceedings, 1999.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. 14.K.-I. Lin, H. V. Jagadish, C. Faloutsos. The TV-tree: An Index Structure for High Dimensional Data. VLDB Journal, pages 517-542, 1992.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. 15.B.-U. Pagel, F. Korn, C. Faloutsos. De ating the Dimensionality Curse Using Multiple Fractal Dimensions. ICDE Conference Proceedings, 2000.]]Google ScholarGoogle Scholar
  16. 16.C.-H. Papadimitriou, P. Raghavan, H. Tamaki, S. Vempala. Latent Semantic Indexing: A Probabilistic Analysis. ACM PODS Conference Proceedings, pages 159-168, 1998.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. 17.K. V. Ravi Kanth, D. Agrawal, A. Singh. Dimensionality Reduction for Similarity Search in Dynamic Databases. ACM SIGMOD Conference Proceedings, 1998.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. 18.N. Roussopoulos, S. Kelley, F. Vincent. Nearest Neighbor Queries. ACM SIGMOD Conference Proceedings, 1995.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. 19.H. Samet. Design and Analysis of Spatial Data Structures. Addison Wesley, 1989.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. 20.T. Sellis, N. Roussopoulos, C. Faloutsos. The R+ Tree: A Dynamic Index for Multidimensional Objects. VLDB Conference Proceedings, pages 507-518, 1987.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. 21.R. Weber, H.-J. Scheck, S. Blott. A Quantitative Analysis and Performance Study for Similarity Search Methods in High Dimensional Spaces. VLDB Conference Proceedings, 1998.]] Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. On the effects of dimensionality reduction on high dimensional similarity search

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          PODS '01: Proceedings of the twentieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
          May 2001
          301 pages
          ISBN:1581133618
          DOI:10.1145/375551

          Copyright © 2001 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 1 May 2001

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • Article

          Acceptance Rates

          PODS '01 Paper Acceptance Rate26of99submissions,26%Overall Acceptance Rate642of2,707submissions,24%

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader