research-article

Efficient set-correlation operator inside databases

Authors:
Shaoxu Song

The Hong Kong University of Science and Technology, Hong Kong, Hong Kong

The Hong Kong University of Science and Technology, Hong Kong, Hong Kong
View Profile

,
Lei Chen

The Hong Kong University of Science and Technology, Hong Kong, Hong Kong

The Hong Kong University of Science and Technology, Hong Kong, Hong Kong
View Profile

CIKM '10: Proceedings of the 19th ACM international conference on Information and knowledge managementOctober 2010Pages 139–148https://doi.org/10.1145/1871437.1871459

Published:26 October 2010Publication History

CIKM '10: Proceedings of the 19th ACM international conference on Information and knowledge management

Pages 139–148

ABSTRACT

Large scale of short text records are now prevalent, such as news highlights, scientific paper citations, and posted messages in a discussion forum, which are often stored as set records in (hidden) databases. Many interesting information retrieval tasks are correspondingly raised on the correlation query over these short text records, such as finding hot topics over news highlights and searching related scientific papers on a certain topic. However, current relational database management systems (RDBMS) do not directly provide support on set correlation query. Thus, in this paper, we address both the effectiveness and efficiency issues of set correlation query over set records in databases. First, we present a framework of set correlation query inside databases. To our best knowledge, only the Pearson's correlation can be implemented to construct token correlations by using RDBMS facilities. Thereby, we propose a novel correlation coefficient to extend Pearson's correlation, and provide a pure-SQL implementation inside databases. We further propose optimal strategies to set up correlation filtering threshold, which can greatly reduce the query time. Our theoretical analysis proves that, with a proper setting of filtering threshold, we can improve the query efficiency with a little effectiveness loss. Finally, we conduct extensive experiments to show the effectiveness and efficiency of proposed correlation query and optimization strategies.

References

J. Allan, C. Wade, and A. Bolivar. Retrieval and novelty detection at the sentence level. In SIGIR, pages 314--321, 2003. Google ScholarDigital Library
A. Arasu, V. Ganti, and R. Kaushik. Efficient exact set-similarity joins. In VLDB, pages 918--929, 2006. Google ScholarDigital Library
A. Arasu, V. Ganti, and R. Kaushik. Efficient exact set-similarity joins. In VLDB, pages 918--929, 2006. Google ScholarDigital Library
N. Balasubramanian, J. Allan, and W. B. Croft. A comparison of sentence retrieval techniques. In SIGIR, pages 813--814, 2007. Google ScholarDigital Library
J. L. Beckmann, A. Halverson, R. Krishnamurthy, and J. F. Naughton. Extending RDBMSs to support sparse datasets using an interpreted attribute storage format. In ICDE, page 58, 2006. Google ScholarDigital Library
M. Bilenko and R. J. Mooney. Adaptive duplicate detection using learnable string similarity measures. In KDD, pages 39--48, 2003. Google ScholarDigital Library
S. Brin, R. Motwani, and C. Silverstein. Beyond market baskets: Generalizing association rules to correlations. In SIGMOD Conference, pages 265--276, 1997. Google ScholarDigital Library
G. Cao, J.-Y. Nie, and J. Bai. Integrating word relationships into language models. In SIGIR, pages 298--305, 2005. Google ScholarDigital Library
S. Chaudhuri, V. Ganti, and R. Kaushik. A primitive operator for similarity joins in data cleaning. In ICDE, page 5, 2006. Google ScholarDigital Library
P.-A. Chirita, C. S. Firan, and W. Nejdl. Personalized query expansion for the web. In SIGIR, pages 7--14, 2007. Google ScholarDigital Library
W. W. Cohen. Integration of heterogeneous databases without common domains using queries based on textual similarity. In SIGMOD Conference, pages 201--212, 1998. Google ScholarDigital Library
S. C. Deerwester, S. T. Dumais, T. K. Landauer, G. W. Furnas, and R. A. Harshman. Indexing by latent semantic analysis. JASIS, 41(6):391--407, 1990.Google ScholarCross Ref
P. Fung and L. Y. Yee. An ir approach for translating new words from nonparallel, comparable texts. In COLING-ACL, pages 414--420, 1998. Google ScholarDigital Library
L. Gravano, P. G. Ipeirotis, N. Koudas, and D. Srivastava. Text joins in an rdbms for web data integration. In WWW, pages 90--101, 2003. Google ScholarDigital Library
M. Hadjieleftheriou, X. Yu, N. Koudas, and D. Srivastava. Hashed samples: selectivity estimators for set similarity selection queries. PVLDB, 1(1):201--212, 2008. Google ScholarDigital Library
T. Hofmann. Probabilistic latent semantic analysis. In UAI, pages 289--296, 1999. Google ScholarDigital Library
T. Hofmann. Probabilistic latent semantic indexing. In SIGIR, pages 50--57, 1999. Google ScholarDigital Library
A. Jain, A. Doan, and L. Gravano. SQL queries over unstructured text databases. In ICDE, 2007.Google ScholarCross Ref
C. Jermaine. The computational complexity of high dimensional correlation search. In ICDM, pages 249--256, 2001. Google ScholarDigital Library
R. Jin, J. Y. Chai, and L. Si. Learn to weight terms in information retrieval using category information. In ICML'05, pages 353--360, 2005. Google ScholarDigital Library
K. Lang. Newsweeder: Learning to filter netnews. In ICML, pages 331--339, 1995.Google ScholarDigital Library
H. Lee, R. T. Ng, and K. Shim. Power-law based estimation of set similarity join size. PVLDB, 2(1):658--669, 2009. Google ScholarDigital Library
D. D. Lewis, Y. Yang, T. G. Rose, and F. Li. Rcv1: A new benchmark collection for text categorization research. Journal of Machine Learning Research, 5:361--397, 2004. Google ScholarDigital Library
X. Li and W. B. Croft. Novelty detection based on sentence level patterns. In CIKM, pages 744--751, 2005. Google ScholarDigital Library
X. Li and W. B. Croft. Improving novelty detection for general topics using sentence level information patterns. In CIKM, pages 238--247, 2006. Google ScholarDigital Library
S. Liu, F. Liu, C. Yu, and W. Meng. An effective approach to document retrieval via utilizing wordnet and recognizing phrases. In SIGIR, pages 266--272, 2004. Google ScholarDigital Library
D. Metzler, S. T. Dumais, and C. Meek. Similarity measures for short segments of text. In ECIR, pages 16--27, 2007. Google ScholarDigital Library
V. Murdock and W. B. Croft. A translation model for sentence retrieval. In HLT/EMNLP, 2005. Google ScholarDigital Library
S. Robertson. Understanding inverse document frequency: On theoretical argument for idf. Journal of Documentation, 60(5):503--520, 2004.Google ScholarCross Ref
M. Sahami and T. D. Heilman. A web-based kernel function for measuring the similarity of short text snippets. In WWW, pages 377--386, 2006. Google ScholarDigital Library
G. Salton. Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer. Addison-Wesley, 1989. Google ScholarDigital Library
S. Sarawagi and A. Bhamidipaty. Interactive deduplication using active learning. In KDD, pages 269--278, 2002. Google ScholarDigital Library
K. Sparck Jones. Index term weighting. Information Storage and Retrieval, 9(11):619--633, 1973.Google ScholarCross Ref
M. Theobald, R. Schenkel, and G. Weikum. Efficient and self-tuning incremental query expansion for top-k query processing. In SIGIR, pages 242--249, 2005. Google ScholarDigital Library
C. J. Van Rijsbergen. Information Retrieval. Butterworth-Heinemann, Newton, MA, USA, 1979. Google ScholarDigital Library
R. W. White and J. M. Jose. A study of topic similarity measures. In SIGIR, pages 520--521, 2004. Google ScholarDigital Library
H. Xiong, S. Shekhar, P.-N. Tan, and V. Kumar. Exploiting a support-based upper bound of pearson's correlation coefficient for efficiently identifying strongly correlated pairs. In KDD, pages 334--343, 2004. Google ScholarDigital Library
H. Xiong, S. Shekhar, P.-N. Tan, and V. Kumar. Taper: A two-step approach for all-strong-pairs correlation query in large databases. IEEE Trans. Knowl. Data Eng., 18(4):493--508, 2006. Google ScholarDigital Library

Index Terms

Efficient set-correlation operator inside databases
1. Information systems
  1. Information systems applications

Recommendations

On multi-set canonical correlation analysis
IJCNN'09: Proceedings of the 2009 international joint conference on Neural Networks

Two-and multi-set canonical correlation analysis (CCA) and (MCCA) techniques are used to find linear combinations that give maximal multivariate differences. This paper describes methods for deriving MCCA dynamical systems which converge to the desired ...
Read More
Correlation search in graph databases
KDD '07: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining

Correlation mining has gained great success in many application domains for its ability to capture the underlying dependency between objects. However, the research of correlation mining from graph databases is still lacking despite the fact that graph ...
Read More
Linear correlation discovery in databases: a data mining approach

Very little research in knowledge discovery has studied how to incorporate statistical methods to automate linear correlation discovery (LCD). We present an automatic LCD methodology that adopts statistical measurement functions to discover correlations ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
CIKM '10: Proceedings of the 19th ACM international conference on Information and knowledge management
October 2010
2036 pages
ISBN:9781450300995
DOI:10.1145/1871437
General Chair:
Jimmy Huang
York University, Canada
,
Program Chairs:
Nick Koudas
University of Toronto, Canada
,
Gareth Jones
Dublin City University, Ireland
,
Xindong Wu
University of Vermont, USA
,
Kevyn Collins-Thompson
Microsoft Research, USA
,
Aijun An
York University, Canada
Copyright © 2010 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 26 October 2010
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
correlation query
set record
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate1,861of8,427submissions,22%
Upcoming Conference
CIKM '24

Sponsor:

sigir

sigir

The 33rd ACM International Conference on Information and Knowledge Management

October 21 - 25, 2024

Boise , ID , USA
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 3
  Total Citations
  View Citations
- 233
  Total Downloads
- Downloads (Last 12 months)2
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Efficient set-correlation operator inside databases

CIKM '10: Proceedings of the 19th ACM international conference on Information and knowledge management

ABSTRACT

References

Cited By

Index Terms

Recommendations

On multi-set canonical correlation analysis

Correlation search in graph databases

Linear correlation discovery in databases: a data mining approach