ABSTRACT
Large scale of short text records are now prevalent, such as news highlights, scientific paper citations, and posted messages in a discussion forum, which are often stored as set records in (hidden) databases. Many interesting information retrieval tasks are correspondingly raised on the correlation query over these short text records, such as finding hot topics over news highlights and searching related scientific papers on a certain topic. However, current relational database management systems (RDBMS) do not directly provide support on set correlation query. Thus, in this paper, we address both the effectiveness and efficiency issues of set correlation query over set records in databases. First, we present a framework of set correlation query inside databases. To our best knowledge, only the Pearson's correlation can be implemented to construct token correlations by using RDBMS facilities. Thereby, we propose a novel correlation coefficient to extend Pearson's correlation, and provide a pure-SQL implementation inside databases. We further propose optimal strategies to set up correlation filtering threshold, which can greatly reduce the query time. Our theoretical analysis proves that, with a proper setting of filtering threshold, we can improve the query efficiency with a little effectiveness loss. Finally, we conduct extensive experiments to show the effectiveness and efficiency of proposed correlation query and optimization strategies.
- J. Allan, C. Wade, and A. Bolivar. Retrieval and novelty detection at the sentence level. In SIGIR, pages 314--321, 2003. Google ScholarDigital Library
- A. Arasu, V. Ganti, and R. Kaushik. Efficient exact set-similarity joins. In VLDB, pages 918--929, 2006. Google ScholarDigital Library
- A. Arasu, V. Ganti, and R. Kaushik. Efficient exact set-similarity joins. In VLDB, pages 918--929, 2006. Google ScholarDigital Library
- N. Balasubramanian, J. Allan, and W. B. Croft. A comparison of sentence retrieval techniques. In SIGIR, pages 813--814, 2007. Google ScholarDigital Library
- J. L. Beckmann, A. Halverson, R. Krishnamurthy, and J. F. Naughton. Extending RDBMSs to support sparse datasets using an interpreted attribute storage format. In ICDE, page 58, 2006. Google ScholarDigital Library
- M. Bilenko and R. J. Mooney. Adaptive duplicate detection using learnable string similarity measures. In KDD, pages 39--48, 2003. Google ScholarDigital Library
- S. Brin, R. Motwani, and C. Silverstein. Beyond market baskets: Generalizing association rules to correlations. In SIGMOD Conference, pages 265--276, 1997. Google ScholarDigital Library
- G. Cao, J.-Y. Nie, and J. Bai. Integrating word relationships into language models. In SIGIR, pages 298--305, 2005. Google ScholarDigital Library
- S. Chaudhuri, V. Ganti, and R. Kaushik. A primitive operator for similarity joins in data cleaning. In ICDE, page 5, 2006. Google ScholarDigital Library
- P.-A. Chirita, C. S. Firan, and W. Nejdl. Personalized query expansion for the web. In SIGIR, pages 7--14, 2007. Google ScholarDigital Library
- W. W. Cohen. Integration of heterogeneous databases without common domains using queries based on textual similarity. In SIGMOD Conference, pages 201--212, 1998. Google ScholarDigital Library
- S. C. Deerwester, S. T. Dumais, T. K. Landauer, G. W. Furnas, and R. A. Harshman. Indexing by latent semantic analysis. JASIS, 41(6):391--407, 1990.Google ScholarCross Ref
- P. Fung and L. Y. Yee. An ir approach for translating new words from nonparallel, comparable texts. In COLING-ACL, pages 414--420, 1998. Google ScholarDigital Library
- L. Gravano, P. G. Ipeirotis, N. Koudas, and D. Srivastava. Text joins in an rdbms for web data integration. In WWW, pages 90--101, 2003. Google ScholarDigital Library
- M. Hadjieleftheriou, X. Yu, N. Koudas, and D. Srivastava. Hashed samples: selectivity estimators for set similarity selection queries. PVLDB, 1(1):201--212, 2008. Google ScholarDigital Library
- T. Hofmann. Probabilistic latent semantic analysis. In UAI, pages 289--296, 1999. Google ScholarDigital Library
- T. Hofmann. Probabilistic latent semantic indexing. In SIGIR, pages 50--57, 1999. Google ScholarDigital Library
- A. Jain, A. Doan, and L. Gravano. SQL queries over unstructured text databases. In ICDE, 2007.Google ScholarCross Ref
- C. Jermaine. The computational complexity of high dimensional correlation search. In ICDM, pages 249--256, 2001. Google ScholarDigital Library
- R. Jin, J. Y. Chai, and L. Si. Learn to weight terms in information retrieval using category information. In ICML'05, pages 353--360, 2005. Google ScholarDigital Library
- K. Lang. Newsweeder: Learning to filter netnews. In ICML, pages 331--339, 1995.Google ScholarDigital Library
- H. Lee, R. T. Ng, and K. Shim. Power-law based estimation of set similarity join size. PVLDB, 2(1):658--669, 2009. Google ScholarDigital Library
- D. D. Lewis, Y. Yang, T. G. Rose, and F. Li. Rcv1: A new benchmark collection for text categorization research. Journal of Machine Learning Research, 5:361--397, 2004. Google ScholarDigital Library
- X. Li and W. B. Croft. Novelty detection based on sentence level patterns. In CIKM, pages 744--751, 2005. Google ScholarDigital Library
- X. Li and W. B. Croft. Improving novelty detection for general topics using sentence level information patterns. In CIKM, pages 238--247, 2006. Google ScholarDigital Library
- S. Liu, F. Liu, C. Yu, and W. Meng. An effective approach to document retrieval via utilizing wordnet and recognizing phrases. In SIGIR, pages 266--272, 2004. Google ScholarDigital Library
- D. Metzler, S. T. Dumais, and C. Meek. Similarity measures for short segments of text. In ECIR, pages 16--27, 2007. Google ScholarDigital Library
- V. Murdock and W. B. Croft. A translation model for sentence retrieval. In HLT/EMNLP, 2005. Google ScholarDigital Library
- S. Robertson. Understanding inverse document frequency: On theoretical argument for idf. Journal of Documentation, 60(5):503--520, 2004.Google ScholarCross Ref
- M. Sahami and T. D. Heilman. A web-based kernel function for measuring the similarity of short text snippets. In WWW, pages 377--386, 2006. Google ScholarDigital Library
- G. Salton. Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer. Addison-Wesley, 1989. Google ScholarDigital Library
- S. Sarawagi and A. Bhamidipaty. Interactive deduplication using active learning. In KDD, pages 269--278, 2002. Google ScholarDigital Library
- K. Sparck Jones. Index term weighting. Information Storage and Retrieval, 9(11):619--633, 1973.Google ScholarCross Ref
- M. Theobald, R. Schenkel, and G. Weikum. Efficient and self-tuning incremental query expansion for top-k query processing. In SIGIR, pages 242--249, 2005. Google ScholarDigital Library
- C. J. Van Rijsbergen. Information Retrieval. Butterworth-Heinemann, Newton, MA, USA, 1979. Google ScholarDigital Library
- R. W. White and J. M. Jose. A study of topic similarity measures. In SIGIR, pages 520--521, 2004. Google ScholarDigital Library
- H. Xiong, S. Shekhar, P.-N. Tan, and V. Kumar. Exploiting a support-based upper bound of pearson's correlation coefficient for efficiently identifying strongly correlated pairs. In KDD, pages 334--343, 2004. Google ScholarDigital Library
- H. Xiong, S. Shekhar, P.-N. Tan, and V. Kumar. Taper: A two-step approach for all-strong-pairs correlation query in large databases. IEEE Trans. Knowl. Data Eng., 18(4):493--508, 2006. Google ScholarDigital Library
Index Terms
- Efficient set-correlation operator inside databases
Recommendations
On multi-set canonical correlation analysis
IJCNN'09: Proceedings of the 2009 international joint conference on Neural NetworksTwo-and multi-set canonical correlation analysis (CCA) and (MCCA) techniques are used to find linear combinations that give maximal multivariate differences. This paper describes methods for deriving MCCA dynamical systems which converge to the desired ...
Correlation search in graph databases
KDD '07: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data miningCorrelation mining has gained great success in many application domains for its ability to capture the underlying dependency between objects. However, the research of correlation mining from graph databases is still lacking despite the fact that graph ...
Linear correlation discovery in databases: a data mining approach
Very little research in knowledge discovery has studied how to incorporate statistical methods to automate linear correlation discovery (LCD). We present an automatic LCD methodology that adopts statistical measurement functions to discover correlations ...
Comments