skip to main content
article

Evaluating top-k queries over web-accessible databases

Published:01 June 2004Publication History
Skip Abstract Section

Abstract

A query to a web search engine usually consists of a list of keywords, to which the search engine responds with the best or "top" k pages for the query. This top-k query model is prevalent over multimedia collections in general, but also over plain relational data for certain applications. For example, consider a relation with information on available restaurants, including their location, price range for one diner, and overall food rating. A user who queries such a relation might simply specify the user's location and target price range, and expect in return the best 10 restaurants in terms of some combination of proximity to the user, closeness of match to the target price range, and overall food rating. Processing top-k queries efficiently is challenging for a number of reasons. One critical such reason is that, in many web applications, the relation attributes might not be available other than through external web-accessible form interfaces, which we will have to query repeatedly for a potentially large set of candidate objects. In this article, we study how to process top-k queries efficiently in this setting, where the attributes for which users specify target values might be handled by external, autonomous sources with a variety of access interfaces. We present a sequential algorithm for processing such queries, but observe that any sequential top-k query processing strategy is bound to require unnecessarily long query processing times, since web accesses exhibit high and variable latency. Fortunately, web sources can be probed in parallel, and each source can typically process concurrent requests, although sources may impose some restrictions on the type and number of probes that they are willing to accept. We adapt our sequential query processing technique and introduce an efficient algorithm that maximizes source-access parallelism to minimize query response time, while satisfying source-access constraints. We evaluate our techniques experimentally using both synthetic and real web-accessible data and show that parallel algorithms can be significantly more efficient than their sequential counterparts.

References

  1. Avnur, R. and Hellerstein, J. M. 2000. Eddies: Continuously adaptive query processing. In Proceedings of the 2000 ACM International Conference on Management of Data (SIGMOD'00). ACM, New York.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Blake, C. and Merz, C. 1998. UCI repository of machine learning databases. ftp://ftp.ics.uci.edu/pub/machine-learning-databases/covtype.]]Google ScholarGoogle Scholar
  3. Bruno, N., Chaudhuri, S., and Gravano, L. 2002a. Top-k selection queries over relational databases: Mapping strategies and performance evaluation. ACM Trans. Datab. Syst. 27, 2.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Bruno, N., Gravano, L., and Marian, A. 2002b. Evaluating top-k queries over web-accessible databases. In Proceedings of the 2002 International Conference on Data Engineering (ICDE'02).]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Carey, M. J. and Kossmann, D. 1997. On saying "Enough Already!" in SQL. In Proceedings of the 1997 ACM International Conference on Management of Data (SIGMOD'97) ACM, New York.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Carey, M. J. and Kossmann, D. 1998. Reducing the braking distance of an SQL query engine. In Proceedings of the 24th International Conference on Very Large Databases (VLDB'98).]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Chang, K. C.-C. and Hwang, S.-W. 2002. Minimal probing: Supporting expensive predicates for top-k queries. In Proceedings of the 2002 ACM International Conference on Management of Data (SIGMOD'02). ACM, New York.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Chaudhuri, S. and Gravano, L. 1996. Optimizing queries over multimedia repositories. In Proceedings of the 1996 ACM International Conference on Management of Data (SIGMOD'96). ACM, New York.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Chaudhuri, S., Gravano, L., and Marian, A. 2004. Optimizing top-k selection queries over multimedia repositories. IEEE Trans. Knowl. Data Eng. (TKDE).]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Chen, C.-M. and Ling, Y. 2002. A sampling-based estimator for top-k query. In Proceedings of the 2002 International Conference on Data Engineering (ICDE'02).]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Donjerkovic, D. and Ramakrishnan, R. 1999. Probabilistic optimization of top n queries. In Proceedings of the 25th International Conference on Very Large Databases (VLDB'99).]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Fagin, R. 1996. Combining fuzzy information from multiple systems. In Proceedings of the 15th ACM Symposium on Principles of Database Systems (PODS'96). ACM, New York.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Fagin, R., Lotem, A., and Naor, M. 2001. Optimal aggregation algorithms for middleware. In Proceedings of the 20th ACM Symposium on Principles of Database Systems (PODS'01). ACM, New York.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Fagin, R., Lotem, A., and Naor, M. 2003. Optimal aggregation algorithms for middleware. J. Comput. Syst. Sci. 66, 4.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Freedman, D., Pisani, R., and Purves, R. 1997. Statistics. W.W. Norton & Company; 3rd edition.]]Google ScholarGoogle Scholar
  16. Goldman, R. and Widom, J. 2000. WSQ/DSQ: A practical approach for combined querying of databases and the web. In Proceedings of the 2000 ACM International Conference on Management of Data (SIGMOD'00). ACM, New York.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Gravano, L., Marian, A., and Bruno, N. 2002. Evaluating top-k queries over web-accessible databases. Tech. Rep., Columbia Univ., New York.]]Google ScholarGoogle Scholar
  18. Güntzer, U., Balke, W.-T., and Kieβling, W. 2000. Optimizing multi-feature queries for image databases. In Proceedings of the 26th International Conference on Very Large Databases (VLDB'00).]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Hellerstein, J. M. and Stonebraker, M. 1993. Predicate migration: Optimizing queries with expensive predicates. In Proceedings of the 1993 ACM International Conference on Management of Data (SIGMOD'93). ACM, New York.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Hristidis, V., Koudas, N., and Papakonstantinou, Y. 2001. PREFER: A system for the efficient execution of multi-parametric ranked queries. In Proceedings of the 2001 ACM International Conference on Management of Data (SIGMOD'01). ACM, New York.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Kemper, A., Moerkotte, G., Peithner, K., and Steinbrunn, M. 1994. Optimizing disjunctive queries with expensive predicates. In Proceedings of the 1994 ACM International Conference on Management of Data (SIGMOD'94). ACM, New York.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Mills, D. 1983. Internet delay experiments; RFC 889. In ARPANET Working Group Requests for Comments. Number 889. SRI International, Menlo Park, Calif.]] Google ScholarGoogle Scholar
  23. Natsev, A., Chang, Y.-C., Smith, J. R., Li, C.-S., and Vitter, J. S. 2001. Supporting incremental join queries on ranked inputs. In Proceedings of the 27th International Conference on Very Large Databases (VLDB'01).]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Nepal, S. and Ramakrishna, M. V. 1999. Query processing issues in image (multimedia) databases. In Proceedings of the 1999 International Conference on Data Engineering (ICDE'99).]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Ortega, M., Rui, Y., Chakrabarti, K., Porkaew, K., Mehrotra, S., and Huang, T. S. 1998. Supporting ranked Boolean similarity queries in MARS. IEEE Trans. Know. Data Eng. (TKDE) 10, 6, 905--925.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Williams, S. A., Press, H., Flannery, B. P., and Vetterling, W. T. 1993. Numerical Recipes in C: The Art of Scientific Computing. Cambridge University Press.]] Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Evaluating top-k queries over web-accessible databases

              Recommendations

              Comments

              Login options

              Check if you have access through your login credentials or your institution to get full access on this article.

              Sign in

              Full Access

              PDF Format

              View or Download as a PDF file.

              PDF

              eReader

              View online with eReader.

              eReader