ABSTRACT
We consider the coverage testing problem where we are given a document and a corpus with a limited query interface and asked to find if the corpus contains a near-duplicate of the document. This problem has applications in search engines for competitive coverage testing. To solve this problem, we propose approaches that work in three main steps: generate a query signature from the document, query the corpus using the query signature and scrape the returned results, and validate the similarity between the input document and the returned results. We discuss techniques to control and bound the performance of these methods. We perform large-scale experimental validation and show that these methods perform well across different search engine corpora and documents in multiple languages. They also are robust against performance parameter variations.
- Z. Bar-Yossef and M. Gurevich. Random sampling from a search engine's index. J. ACM, 55(5):1--74, 2008. Google ScholarDigital Library
- P. Baxendale. Machine-made index for technical literature experiment. IBM J. Research and Development, 2:354--361, October 1958.Google ScholarDigital Library
- K. Bharat and A. Broder. A technique for measuring the relative size and overlap of public web search engines. Comput. Netw. ISDN Syst., 30(1-7):379--388, 1998. Google ScholarDigital Library
- T. Brants, A. C. Popat, P. Xu, F. J. Och, and J. Dean. Large language models in machine translation. In Proc. Joint Conf. Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pp. 858--867. ACL, 2007.Google Scholar
- S. Brin, J. Davis, and H. García-Molina. Copy detection mechanisms for digital documents. In Proc. Int. Conf. Management of Data (SIGMOD), pp. 398--409. ACM, 1995. Google ScholarDigital Library
- A. Broder. On the resemblance and containment of documents. In Proc. Compression and Complexity of Sequences (SEQUENCES), page 21. IEEE, 1997. Google ScholarDigital Library
- A. Broder, M. Fontura, V. Josifovski, R. Kumar, R. Motwani, S. Nabar, R. Panigrahy, A. Tomkins, and Y. Xu. Estimating corpus size via queries. In Proc. Int. Conf. Info. and Knowledge Management (CIKM), pp. 594--603. ACM, 2006. Google ScholarDigital Library
- A. Z. Broder. Some applications of rabin's fingerprinting method. In Sequences II: Methods in Communications, Security, and Computer Science, pp. 143--152. Springer-Verlag, 1993.Google ScholarCross Ref
- A. Z. Broder. Identifying and filtering near-duplicate documents. In Proc. Symp. Combinatorial Pattern Matching (COM), pp. 1--10. Springer-Verlag, 2000. Google ScholarDigital Library
- A. Z. Broder, M. Charikar, A. M. Frieze, and M. Mitzenmacher. Min-wise independent permutations (extended abstract). In Proc. Symp. Theory of Computing (STOC), pp. 327--336. ACM, 1998. Google ScholarDigital Library
- A. Z. Broder, S. C. Glassman, M. S. Manasse, and G. Zweig. Syntactic clustering of the web. Comput. Netw. ISDN Syst., 29(8-13):1157--1166, 1997. Google ScholarDigital Library
- S. Chakrabarti. Mining the Web. Morgan Kaufmann, 2003.Google Scholar
- M. S. Charikar. Similarity estimation techniques from rounding algorithms. In Proc. Symp. Theory of Computing (STOC), pp. 380--388. ACM, 2002. Google ScholarDigital Library
- A. Dasdan, K. Tsioutsiouliklis, and E. Velipasaoglu. Web search engine metrics. Tutorial in Int. Conf. World Wide Web (WWW), ACM, 2009.Google Scholar
- R. Ghani, R. Jones, and D. Mladenic. Automatic web search query generation to create minority language corpora. In Proc. of Conf. on Research and Dev. in Info. Retrieval (SIGIR), pp. 432--433. ACM, 2001. Google ScholarDigital Library
- R. L. Graham, D. E. Knuth, and O. Patashnik. Concrete Mathematics. Addison-Wesley, 1994. Google ScholarDigital Library
- M. Henzinger. Finding near-duplicate web pp.: a large-scale evaluation of algorithms. In Proc. of Conf. on Research and Dev. in Info. Retrieval (SIGIR), pp. 284--291. ACM, 2006. Google ScholarDigital Library
- K. Järvelin and J. Kekäläinen. Cumulated gain-based evaluation of IR techniques. ACM Trans. Inf. Syst., 20(4):422--446, 2002. Google ScholarDigital Library
- H. Luhn. A statistical approach to mechanized encoding and searching of literary information. IBM J. Research and Development, 1(4):309--317, 1957.Google ScholarDigital Library
- H. Luhn. The automatic creation of literature abstracts. IBM J. Research and Development, 2(2), 1958.Google ScholarDigital Library
- G. S. Manku, A. Jain, and A. D. Sarma. Detecting near-duplicates for web crawling. In Proc. Int. Conf. World Wide Web (WWW), pp. 141--150. ACM, 2007. Google ScholarDigital Library
- F. D. McSherry, K. Talwar, and M. D. Manasse. Consistent weighted sampling of multisets and distributions. U.S. Patent Appl., Sep 2008.Google Scholar
- R. Motwani and P. Raghavan. Randomized Algorithms. Cambridge University Press, 1995. Google ScholarDigital Library
- T. Noreault, M. McGill, and M. Koll. A performance evaluation of similarity measures, document term weighting schemes and representations in a boolean environment. In Proc. of Conf. on Research and Dev. in Info. Retrieval (SIGIR), pp. 57--76. ACM, 1981. Google ScholarDigital Library
- A. Pereira Jr. and N. Ziviani. Retrieving similar documents from the Web. J. Web Engineering, 2(4):247--261, 2004. Google ScholarDigital Library
- N. Shivakumar and H. García-Molina. SCAM: A copy detection mechanism for digital documents. In Proc. Int. Conf. Digital Libraries (DL). ACM, 1995.Google Scholar
- N. Shivakumar and H. García-Molina. Building a scalable and accurate copy detection mechanism. In Proc. Int. Conf. Digital Libraries (DL), pp. 160--168. ACM, 1996. Google ScholarDigital Library
- D. R. Swanson. Historical note: Information retrieval and the future of an illusion. J. American Society for Information Science, 39(2):92--98, 1988.Google ScholarCross Ref
- Q. Tan, Z. Zhuang, P. Mitra, and C. L. Giles. Designing efficient sampling techniques to detect webpage updates. In Proc. Int. Conf. World Wide Web (WWW), pp. 1147--1148. ACM, 2007. Google ScholarDigital Library
- Y. Yang, N. Bansal, W. Dakka, P. Ipeirotis, N. Koudas, and D. Papadias. Query by document. In Proc. Int. Conf. Web Search and Data Mining, pp. 34--43. ACM, 2009. Google ScholarDigital Library
Index Terms
- Automatic retrieval of similar content using search engine query interface
Recommendations
Re-ranking search results using query logs
CIKM '06: Proceedings of the 15th ACM international conference on Information and knowledge managementThis work addresses two common problems in search, frequently occurring with underspecified user queries: the top-ranked results for such queries may not contain documents relevant to the user's search intent, and fresh and relevant pages may not get ...
A Text Similarity Meta-Search Engine Based on Document Fingerprints and Search Results Records
WI-IAT '11: Proceedings of the 2011 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology - Volume 01The retrieval of similar documents from the Web using documents as input instead of key-term queries is not currently supported by traditional Web search engines. One approach for solving the problem consists of fingerprint the document's content into a ...
Comments