ABSTRACT
Recent work in data integration has shown the importance of statistical information about the coverage and overlap of sources for efficient query processing. Despite this recognition there are no effective approaches for learning the needed statistics. The key challenge in learning such statistics is keeping the number of needed statistics low enough to have the storage and learning costs manageable. Naive approaches can become infeasible very quickly. In this paper we present a set of connected techniques that estimate the coverage and overlap statistics while keeping the needed statistics tightly under control. Our approach uses a hierarchical classification of the queries, and threshold based variants of familiar data mining techniques to dynamically decide the level of resolution at which to learn the statistics. We describe the details of our method, and present experimental results demonstrating the efficiency of the learning algorithms and the effectiveness of the learned statistics.
- Rakesh Agrawal, Ramakrishnan Srikant. Fast Algorithms for Mining Association Rules. In VLDB, Santiago, Chile, 1994. Google ScholarDigital Library
- D. Florescu, D. Koller, and A. Levy. Using probabilistic information in data integration. In Proceeding of the International Conference on Very Large Data Bases (VLDB), 1997. Google ScholarDigital Library
- Jiawei Han and Micheline Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmman Publishers, 2000. Google ScholarDigital Library
- P. Ipeirotis, L. Gravano, M. Sahami. Probe, Count, and Classify: Categorizing Hidden Web Dababases. In Proceedings of SIGMOD-01, 2001. Google ScholarDigital Library
- Z. Nie and S. Kambhampati. Joint optimization of cost and coverage of query plans in data integration. In ACM CIKM, Atlanta, Georgia, November 2001. Google ScholarDigital Library
- Z. Nie, S. Kambhampati, U. Nambiar and S. Vaddi. Mining Source Coverage Statistics for Data Integration. Proc. WIDM(CIKM workshop) 2001. Google ScholarDigital Library
- Z. Nie, U. Nambiar, S. Vaddi and S. Kambhampati. Mining Coverage Statistics for Websource Selection in a Mediator. ASU CSE TR 02-009. Computer Science & Engg. Arizona State University. http://rakaposhi.eas.asu.edu/statminer-tr.pdf.Google Scholar
- Transaction Processing Council. http://www.tpc.org.Google Scholar
Index Terms
- Mining coverage statistics for websource selection in a mediator
Recommendations
Mining source coverage statistics for data integration
WIDM '01: Proceedings of the 3rd international workshop on Web information and data managementRecent work in data integration has shown the importance of statistical information about the coverage and overlap of sources for efficient query processing. Despite this recognition there are no effective approaches for learning the needed statistics. ...
Effectively Mining and Using Coverage and Overlap Statistics for Data Integration
Recent work in data integration has shown the importance of statistical information about the coverage and overlap of sources for efficient query processing. Despite this recognition, there are no effective approaches for learning the needed statistics. ...
Multivariate U-statistics: a tutorial with applications
U-statistics represent an important class of statistics arising from modeling quantities of interest defined by multi-subject responses such as the classic Mann-Whitney-Wilcoxon rank tests. However, classic applications of U-statistics are largely ...
Comments