skip to main content
10.1145/1871437.1871459acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article

Efficient set-correlation operator inside databases

Authors Info & Claims
Published:26 October 2010Publication History

ABSTRACT

Large scale of short text records are now prevalent, such as news highlights, scientific paper citations, and posted messages in a discussion forum, which are often stored as set records in (hidden) databases. Many interesting information retrieval tasks are correspondingly raised on the correlation query over these short text records, such as finding hot topics over news highlights and searching related scientific papers on a certain topic. However, current relational database management systems (RDBMS) do not directly provide support on set correlation query. Thus, in this paper, we address both the effectiveness and efficiency issues of set correlation query over set records in databases. First, we present a framework of set correlation query inside databases. To our best knowledge, only the Pearson's correlation can be implemented to construct token correlations by using RDBMS facilities. Thereby, we propose a novel correlation coefficient to extend Pearson's correlation, and provide a pure-SQL implementation inside databases. We further propose optimal strategies to set up correlation filtering threshold, which can greatly reduce the query time. Our theoretical analysis proves that, with a proper setting of filtering threshold, we can improve the query efficiency with a little effectiveness loss. Finally, we conduct extensive experiments to show the effectiveness and efficiency of proposed correlation query and optimization strategies.

References

  1. J. Allan, C. Wade, and A. Bolivar. Retrieval and novelty detection at the sentence level. In SIGIR, pages 314--321, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. A. Arasu, V. Ganti, and R. Kaushik. Efficient exact set-similarity joins. In VLDB, pages 918--929, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. A. Arasu, V. Ganti, and R. Kaushik. Efficient exact set-similarity joins. In VLDB, pages 918--929, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. N. Balasubramanian, J. Allan, and W. B. Croft. A comparison of sentence retrieval techniques. In SIGIR, pages 813--814, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. J. L. Beckmann, A. Halverson, R. Krishnamurthy, and J. F. Naughton. Extending RDBMSs to support sparse datasets using an interpreted attribute storage format. In ICDE, page 58, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. M. Bilenko and R. J. Mooney. Adaptive duplicate detection using learnable string similarity measures. In KDD, pages 39--48, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. S. Brin, R. Motwani, and C. Silverstein. Beyond market baskets: Generalizing association rules to correlations. In SIGMOD Conference, pages 265--276, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. G. Cao, J.-Y. Nie, and J. Bai. Integrating word relationships into language models. In SIGIR, pages 298--305, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. S. Chaudhuri, V. Ganti, and R. Kaushik. A primitive operator for similarity joins in data cleaning. In ICDE, page 5, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. P.-A. Chirita, C. S. Firan, and W. Nejdl. Personalized query expansion for the web. In SIGIR, pages 7--14, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. W. W. Cohen. Integration of heterogeneous databases without common domains using queries based on textual similarity. In SIGMOD Conference, pages 201--212, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. S. C. Deerwester, S. T. Dumais, T. K. Landauer, G. W. Furnas, and R. A. Harshman. Indexing by latent semantic analysis. JASIS, 41(6):391--407, 1990.Google ScholarGoogle ScholarCross RefCross Ref
  13. P. Fung and L. Y. Yee. An ir approach for translating new words from nonparallel, comparable texts. In COLING-ACL, pages 414--420, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. L. Gravano, P. G. Ipeirotis, N. Koudas, and D. Srivastava. Text joins in an rdbms for web data integration. In WWW, pages 90--101, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. M. Hadjieleftheriou, X. Yu, N. Koudas, and D. Srivastava. Hashed samples: selectivity estimators for set similarity selection queries. PVLDB, 1(1):201--212, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. T. Hofmann. Probabilistic latent semantic analysis. In UAI, pages 289--296, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. T. Hofmann. Probabilistic latent semantic indexing. In SIGIR, pages 50--57, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. A. Jain, A. Doan, and L. Gravano. SQL queries over unstructured text databases. In ICDE, 2007.Google ScholarGoogle ScholarCross RefCross Ref
  19. C. Jermaine. The computational complexity of high dimensional correlation search. In ICDM, pages 249--256, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. R. Jin, J. Y. Chai, and L. Si. Learn to weight terms in information retrieval using category information. In ICML'05, pages 353--360, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. K. Lang. Newsweeder: Learning to filter netnews. In ICML, pages 331--339, 1995.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. H. Lee, R. T. Ng, and K. Shim. Power-law based estimation of set similarity join size. PVLDB, 2(1):658--669, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. D. D. Lewis, Y. Yang, T. G. Rose, and F. Li. Rcv1: A new benchmark collection for text categorization research. Journal of Machine Learning Research, 5:361--397, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. X. Li and W. B. Croft. Novelty detection based on sentence level patterns. In CIKM, pages 744--751, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. X. Li and W. B. Croft. Improving novelty detection for general topics using sentence level information patterns. In CIKM, pages 238--247, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. S. Liu, F. Liu, C. Yu, and W. Meng. An effective approach to document retrieval via utilizing wordnet and recognizing phrases. In SIGIR, pages 266--272, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. D. Metzler, S. T. Dumais, and C. Meek. Similarity measures for short segments of text. In ECIR, pages 16--27, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. V. Murdock and W. B. Croft. A translation model for sentence retrieval. In HLT/EMNLP, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. S. Robertson. Understanding inverse document frequency: On theoretical argument for idf. Journal of Documentation, 60(5):503--520, 2004.Google ScholarGoogle ScholarCross RefCross Ref
  30. M. Sahami and T. D. Heilman. A web-based kernel function for measuring the similarity of short text snippets. In WWW, pages 377--386, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. G. Salton. Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer. Addison-Wesley, 1989. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. S. Sarawagi and A. Bhamidipaty. Interactive deduplication using active learning. In KDD, pages 269--278, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. K. Sparck Jones. Index term weighting. Information Storage and Retrieval, 9(11):619--633, 1973.Google ScholarGoogle ScholarCross RefCross Ref
  34. M. Theobald, R. Schenkel, and G. Weikum. Efficient and self-tuning incremental query expansion for top-k query processing. In SIGIR, pages 242--249, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. C. J. Van Rijsbergen. Information Retrieval. Butterworth-Heinemann, Newton, MA, USA, 1979. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. R. W. White and J. M. Jose. A study of topic similarity measures. In SIGIR, pages 520--521, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. H. Xiong, S. Shekhar, P.-N. Tan, and V. Kumar. Exploiting a support-based upper bound of pearson's correlation coefficient for efficiently identifying strongly correlated pairs. In KDD, pages 334--343, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. H. Xiong, S. Shekhar, P.-N. Tan, and V. Kumar. Taper: A two-step approach for all-strong-pairs correlation query in large databases. IEEE Trans. Knowl. Data Eng., 18(4):493--508, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Efficient set-correlation operator inside databases

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      CIKM '10: Proceedings of the 19th ACM international conference on Information and knowledge management
      October 2010
      2036 pages
      ISBN:9781450300995
      DOI:10.1145/1871437

      Copyright © 2010 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 26 October 2010

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate1,861of8,427submissions,22%

      Upcoming Conference

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader