ABSTRACT
The dimensionality curse has profound effects on the effectiveness of high-dimensional similarity indexing from the performance perspective. One of the well known techniques for improving the indexing performance is the method of dimensionality reduction. In this technique, the data is transformed to a lower dimensional space by finding a new axis-system in which most of the data variance is preserved in a few dimensions. This reduction may also have a positive effect on the quality of similarity for certain data domains such as text. For other domains, it may lead to loss of information and degradation of search quality. Recent research indicates that the improvement for the text domain is caused by the re-enforcement of the semantic concepts in the data. In this paper, we provide an intuitive model of the effects of dimensionality reduction on arbitrary high dimensional problems. We provide an effective diagnosis of the causality behind the qualitative effects of dimensionality reduction on a given data set. The analysis suggests that these effects are very data dependent. Our analysis also indicates that currently accepted techniques of picking the reduction which results in the least loss of information are useful for maximizing precision and recall, but are not necessarily optimum from a qualitative perspective. We demonstrate that by making simple changes to the implementation details of dimensionality reduction techniques, we can considerably improve the quality of similarity search.
- 1.C. C. Aggarwal, A. Hinneburg, D. A. Keim. On the Surprising Behavior of Distance Metrics in High Dimensional Space. ICDT Conference Proceedings, 2001.]] Google ScholarDigital Library
- 2.C. C. Aggarwal, P. S. Yu. Finding Generalized Projected Clusters in High Dimensional Spaces. ACM SIGMOD Conference Proceedings, 2000.]] Google ScholarDigital Library
- 3.C. C. Aggarwal, P. S. Yu. The IGrid Index: Reversing the Dimensionality Curse for Similarity Indexing in High Dimensional Space. ACM SIGKDD Conference Proceedings, 2000.]] Google ScholarDigital Library
- 4.S. Berchtold, D. A. Keim, H.-P. Kriegel. The X-Tree: An Index Structure for High Dimensional Data. VLDB Conference Proceedings, pages 28-39, September 1996.]] Google ScholarDigital Library
- 5.K. Beyer et al. When is Nearest Neighbors Meaningful? ICDT Conference, 1999.]] Google ScholarDigital Library
- 6.K. Chakrabarti, S. Mehrotra. Local Dimensionality Reduction: A New Approach to Indexing High Dimensional Spaces. VLDB Conference Proceedings, 2000.]] Google ScholarDigital Library
- 7.S. Deerwester et al. Indexing by Latent Semantic Analysis. Journal of the American Society for Information Science, 41(6): pages 391-407, 1990.]]Google ScholarCross Ref
- 8.C. Ding. A Similarity Based Probability Model for Latent Semantic Indexing. ACM SIGIR Conference Proceedings, 1999.]] Google ScholarDigital Library
- 9.C. Faloutsos, K.-I. Lin. FastMap: A Fast Algorithm for Indexing, Data-Mining and Visualization of Traditional and Multimedia Datasets. ACM SIGMOD Conference Proceedings, 1995.]] Google ScholarDigital Library
- 10.A. Guttman. R-Trees: A Dynamic Index Structure for Spatial Searching. ACM SIGMOD Conference Proceedings 1984.]] Google ScholarDigital Library
- 11.A. Hinneburg, C. C. Aggarwal, D. A. Keim. What is the Nearest Neighbor in High Dimensional Spaces? VLDB Conference Proceedings, 2000.]] Google ScholarDigital Library
- 12.I. T. Jolliffe. Principal Component Analysis, Springer-Verlag, New York, 1986.]]Google ScholarCross Ref
- 13.J. Kleinberg, A. Tomkins. Applications of Linear Algebra in Information Retrieval and Hypertext Analysis. ACM PODS Conference Proceedings, 1999.]] Google ScholarDigital Library
- 14.K.-I. Lin, H. V. Jagadish, C. Faloutsos. The TV-tree: An Index Structure for High Dimensional Data. VLDB Journal, pages 517-542, 1992.]] Google ScholarDigital Library
- 15.B.-U. Pagel, F. Korn, C. Faloutsos. De ating the Dimensionality Curse Using Multiple Fractal Dimensions. ICDE Conference Proceedings, 2000.]]Google Scholar
- 16.C.-H. Papadimitriou, P. Raghavan, H. Tamaki, S. Vempala. Latent Semantic Indexing: A Probabilistic Analysis. ACM PODS Conference Proceedings, pages 159-168, 1998.]] Google ScholarDigital Library
- 17.K. V. Ravi Kanth, D. Agrawal, A. Singh. Dimensionality Reduction for Similarity Search in Dynamic Databases. ACM SIGMOD Conference Proceedings, 1998.]] Google ScholarDigital Library
- 18.N. Roussopoulos, S. Kelley, F. Vincent. Nearest Neighbor Queries. ACM SIGMOD Conference Proceedings, 1995.]] Google ScholarDigital Library
- 19.H. Samet. Design and Analysis of Spatial Data Structures. Addison Wesley, 1989.]] Google ScholarDigital Library
- 20.T. Sellis, N. Roussopoulos, C. Faloutsos. The R+ Tree: A Dynamic Index for Multidimensional Objects. VLDB Conference Proceedings, pages 507-518, 1987.]] Google ScholarDigital Library
- 21.R. Weber, H.-J. Scheck, S. Blott. A Quantitative Analysis and Performance Study for Similarity Search Methods in High Dimensional Spaces. VLDB Conference Proceedings, 1998.]] Google ScholarDigital Library
Index Terms
- On the effects of dimensionality reduction on high dimensional similarity search
Recommendations
Linear Dimensionality Reduction via a Heteroscedastic Extension of LDA: The Chernoff Criterion
Abstract--We propose an eigenvector-based heteroscedastic linear dimension reduction (LDR) technique for multiclass data. The technique is based on a heteroscedastic two-class technique which utilizes the so-called Chernoff criterion, and successfully ...
Dimensionality reduction-based spoken emotion recognition
To improve effectively the performance on spoken emotion recognition, it is needed to perform nonlinear dimensionality reduction for speech data lying on a nonlinear manifold embedded in a high-dimensional acoustic space. In this paper, a new supervised ...
Local Fisher discriminant analysis for supervised dimensionality reduction
ICML '06: Proceedings of the 23rd international conference on Machine learningDimensionality reduction is one of the important preprocessing steps in high-dimensional data analysis. In this paper, we consider the supervised dimensionality reduction problem where samples are accompanied with class labels. Traditional Fisher ...
Comments