research-article

Automatic retrieval of similar content using search engine query interface

Authors:
Ali Dasdan

Yahoo! Inc., Sunnyvale, CA, USA

Yahoo! Inc., Sunnyvale, CA, USA
View Profile

,
Paolo D'Alberto

Yahoo! Inc., Sunnyvale, CA, USA

Yahoo! Inc., Sunnyvale, CA, USA
View Profile

,
Santanu Kolay

Yahoo! Inc., Sunnyvale, CA, USA

Yahoo! Inc., Sunnyvale, CA, USA
View Profile

,
Chris Drome

Yahoo! Inc., Sunnyvale, CA, USA

Yahoo! Inc., Sunnyvale, CA, USA
View Profile

CIKM '09: Proceedings of the 18th ACM conference on Information and knowledge managementNovember 2009Pages 701–710https://doi.org/10.1145/1645953.1646043

Published:02 November 2009Publication History

CIKM '09: Proceedings of the 18th ACM conference on Information and knowledge management

Pages 701–710

ABSTRACT

We consider the coverage testing problem where we are given a document and a corpus with a limited query interface and asked to find if the corpus contains a near-duplicate of the document. This problem has applications in search engines for competitive coverage testing. To solve this problem, we propose approaches that work in three main steps: generate a query signature from the document, query the corpus using the query signature and scrape the returned results, and validate the similarity between the input document and the returned results. We discuss techniques to control and bound the performance of these methods. We perform large-scale experimental validation and show that these methods perform well across different search engine corpora and documents in multiple languages. They also are robust against performance parameter variations.

References

Z. Bar-Yossef and M. Gurevich. Random sampling from a search engine's index. J. ACM, 55(5):1--74, 2008. Google ScholarDigital Library
P. Baxendale. Machine-made index for technical literature experiment. IBM J. Research and Development, 2:354--361, October 1958.Google ScholarDigital Library
K. Bharat and A. Broder. A technique for measuring the relative size and overlap of public web search engines. Comput. Netw. ISDN Syst., 30(1-7):379--388, 1998. Google ScholarDigital Library
T. Brants, A. C. Popat, P. Xu, F. J. Och, and J. Dean. Large language models in machine translation. In Proc. Joint Conf. Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pp. 858--867. ACL, 2007.Google Scholar
S. Brin, J. Davis, and H. García-Molina. Copy detection mechanisms for digital documents. In Proc. Int. Conf. Management of Data (SIGMOD), pp. 398--409. ACM, 1995. Google ScholarDigital Library
A. Broder. On the resemblance and containment of documents. In Proc. Compression and Complexity of Sequences (SEQUENCES), page 21. IEEE, 1997. Google ScholarDigital Library
A. Broder, M. Fontura, V. Josifovski, R. Kumar, R. Motwani, S. Nabar, R. Panigrahy, A. Tomkins, and Y. Xu. Estimating corpus size via queries. In Proc. Int. Conf. Info. and Knowledge Management (CIKM), pp. 594--603. ACM, 2006. Google ScholarDigital Library
A. Z. Broder. Some applications of rabin's fingerprinting method. In Sequences II: Methods in Communications, Security, and Computer Science, pp. 143--152. Springer-Verlag, 1993.Google ScholarCross Ref
A. Z. Broder. Identifying and filtering near-duplicate documents. In Proc. Symp. Combinatorial Pattern Matching (COM), pp. 1--10. Springer-Verlag, 2000. Google ScholarDigital Library
A. Z. Broder, M. Charikar, A. M. Frieze, and M. Mitzenmacher. Min-wise independent permutations (extended abstract). In Proc. Symp. Theory of Computing (STOC), pp. 327--336. ACM, 1998. Google ScholarDigital Library
A. Z. Broder, S. C. Glassman, M. S. Manasse, and G. Zweig. Syntactic clustering of the web. Comput. Netw. ISDN Syst., 29(8-13):1157--1166, 1997. Google ScholarDigital Library
S. Chakrabarti. Mining the Web. Morgan Kaufmann, 2003.Google Scholar
M. S. Charikar. Similarity estimation techniques from rounding algorithms. In Proc. Symp. Theory of Computing (STOC), pp. 380--388. ACM, 2002. Google ScholarDigital Library
A. Dasdan, K. Tsioutsiouliklis, and E. Velipasaoglu. Web search engine metrics. Tutorial in Int. Conf. World Wide Web (WWW), ACM, 2009.Google Scholar
R. Ghani, R. Jones, and D. Mladenic. Automatic web search query generation to create minority language corpora. In Proc. of Conf. on Research and Dev. in Info. Retrieval (SIGIR), pp. 432--433. ACM, 2001. Google ScholarDigital Library
R. L. Graham, D. E. Knuth, and O. Patashnik. Concrete Mathematics. Addison-Wesley, 1994. Google ScholarDigital Library
M. Henzinger. Finding near-duplicate web pp.: a large-scale evaluation of algorithms. In Proc. of Conf. on Research and Dev. in Info. Retrieval (SIGIR), pp. 284--291. ACM, 2006. Google ScholarDigital Library
K. Järvelin and J. Kekäläinen. Cumulated gain-based evaluation of IR techniques. ACM Trans. Inf. Syst., 20(4):422--446, 2002. Google ScholarDigital Library
H. Luhn. A statistical approach to mechanized encoding and searching of literary information. IBM J. Research and Development, 1(4):309--317, 1957.Google ScholarDigital Library
H. Luhn. The automatic creation of literature abstracts. IBM J. Research and Development, 2(2), 1958.Google ScholarDigital Library
G. S. Manku, A. Jain, and A. D. Sarma. Detecting near-duplicates for web crawling. In Proc. Int. Conf. World Wide Web (WWW), pp. 141--150. ACM, 2007. Google ScholarDigital Library
F. D. McSherry, K. Talwar, and M. D. Manasse. Consistent weighted sampling of multisets and distributions. U.S. Patent Appl., Sep 2008.Google Scholar
R. Motwani and P. Raghavan. Randomized Algorithms. Cambridge University Press, 1995. Google ScholarDigital Library
T. Noreault, M. McGill, and M. Koll. A performance evaluation of similarity measures, document term weighting schemes and representations in a boolean environment. In Proc. of Conf. on Research and Dev. in Info. Retrieval (SIGIR), pp. 57--76. ACM, 1981. Google ScholarDigital Library
A. Pereira Jr. and N. Ziviani. Retrieving similar documents from the Web. J. Web Engineering, 2(4):247--261, 2004. Google ScholarDigital Library
N. Shivakumar and H. García-Molina. SCAM: A copy detection mechanism for digital documents. In Proc. Int. Conf. Digital Libraries (DL). ACM, 1995.Google Scholar
N. Shivakumar and H. García-Molina. Building a scalable and accurate copy detection mechanism. In Proc. Int. Conf. Digital Libraries (DL), pp. 160--168. ACM, 1996. Google ScholarDigital Library
D. R. Swanson. Historical note: Information retrieval and the future of an illusion. J. American Society for Information Science, 39(2):92--98, 1988.Google ScholarCross Ref
Q. Tan, Z. Zhuang, P. Mitra, and C. L. Giles. Designing efficient sampling techniques to detect webpage updates. In Proc. Int. Conf. World Wide Web (WWW), pp. 1147--1148. ACM, 2007. Google ScholarDigital Library
Y. Yang, N. Bansal, W. Dakka, P. Ipeirotis, N. Koudas, and D. Papadias. Query by document. In Proc. Int. Conf. Web Search and Data Mining, pp. 34--43. ACM, 2009. Google ScholarDigital Library

Index Terms

Automatic retrieval of similar content using search engine query interface
1. Information systems
  1. Information retrieval
    1. Information retrieval query processing

Recommendations

Re-ranking search results using query logs
CIKM '06: Proceedings of the 15th ACM international conference on Information and knowledge management

This work addresses two common problems in search, frequently occurring with underspecified user queries: the top-ranked results for such queries may not contain documents relevant to the user's search intent, and fresh and relevant pages may not get ...
Read More
A Text Similarity Meta-Search Engine Based on Document Fingerprints and Search Results Records
WI-IAT '11: Proceedings of the 2011 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology - Volume 01

The retrieval of similar documents from the Web using documents as input instead of key-term queries is not currently supported by traditional Web search engines. One approach for solving the problem consists of fingerprint the document's content into a ...
Read More
Automatic search interface clustering and search result processing in metasearch engine
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
CIKM '09: Proceedings of the 18th ACM conference on Information and knowledge management
November 2009
2162 pages
ISBN:9781605585123
DOI:10.1145/1645953
General Chairs:
David Cheung
University of Hong Kong, Hong Kong
,
Il-Yeol Song
Drexel University, USA
,
Program Chairs:
Wesley Chu
UCLA, USA
,
Xiaohua Hu
Drexel University, USA
,
Jimmy Lin
University of Maryland, USA
Copyright © 2009 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 2 November 2009
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
coverage
keyword extraction
query generation
random sampling
strong query
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate1,861of8,427submissions,22%
Upcoming Conference
CIKM '24

Sponsor:

sigir

sigir

The 33rd ACM International Conference on Information and Knowledge Management

October 21 - 25, 2024

Boise , ID , USA
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 17
  Total Citations
  View Citations
- 379
  Total Downloads
- Downloads (Last 12 months)6
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Automatic retrieval of similar content using search engine query interface

CIKM '09: Proceedings of the 18th ACM conference on Information and knowledge management

ABSTRACT

References

Cited By

Index Terms

Recommendations

Re-ranking search results using query logs

A Text Similarity Meta-Search Engine Based on Document Fingerprints and Search Results Records

Automatic search interface clustering and search result processing in metasearch engine

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Automatic retrieval of similar content using search engine query interface

CIKM '09: Proceedings of the 18th ACM conference on Information and knowledge management

ABSTRACT

References

Cited By

Index Terms

Recommendations

Re-ranking search results using query logs

A Text Similarity Meta-Search Engine Based on Document Fingerprints and Search Results Records

Automatic search interface clustering and search result processing in metasearch engine

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media