skip to main content
10.1145/1645953.1646043acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article

Automatic retrieval of similar content using search engine query interface

Published:02 November 2009Publication History

ABSTRACT

We consider the coverage testing problem where we are given a document and a corpus with a limited query interface and asked to find if the corpus contains a near-duplicate of the document. This problem has applications in search engines for competitive coverage testing. To solve this problem, we propose approaches that work in three main steps: generate a query signature from the document, query the corpus using the query signature and scrape the returned results, and validate the similarity between the input document and the returned results. We discuss techniques to control and bound the performance of these methods. We perform large-scale experimental validation and show that these methods perform well across different search engine corpora and documents in multiple languages. They also are robust against performance parameter variations.

References

  1. Z. Bar-Yossef and M. Gurevich. Random sampling from a search engine's index. J. ACM, 55(5):1--74, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. P. Baxendale. Machine-made index for technical literature experiment. IBM J. Research and Development, 2:354--361, October 1958.Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. K. Bharat and A. Broder. A technique for measuring the relative size and overlap of public web search engines. Comput. Netw. ISDN Syst., 30(1-7):379--388, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. T. Brants, A. C. Popat, P. Xu, F. J. Och, and J. Dean. Large language models in machine translation. In Proc. Joint Conf. Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pp. 858--867. ACL, 2007.Google ScholarGoogle Scholar
  5. S. Brin, J. Davis, and H. García-Molina. Copy detection mechanisms for digital documents. In Proc. Int. Conf. Management of Data (SIGMOD), pp. 398--409. ACM, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. A. Broder. On the resemblance and containment of documents. In Proc. Compression and Complexity of Sequences (SEQUENCES), page 21. IEEE, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. A. Broder, M. Fontura, V. Josifovski, R. Kumar, R. Motwani, S. Nabar, R. Panigrahy, A. Tomkins, and Y. Xu. Estimating corpus size via queries. In Proc. Int. Conf. Info. and Knowledge Management (CIKM), pp. 594--603. ACM, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. A. Z. Broder. Some applications of rabin's fingerprinting method. In Sequences II: Methods in Communications, Security, and Computer Science, pp. 143--152. Springer-Verlag, 1993.Google ScholarGoogle ScholarCross RefCross Ref
  9. A. Z. Broder. Identifying and filtering near-duplicate documents. In Proc. Symp. Combinatorial Pattern Matching (COM), pp. 1--10. Springer-Verlag, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. A. Z. Broder, M. Charikar, A. M. Frieze, and M. Mitzenmacher. Min-wise independent permutations (extended abstract). In Proc. Symp. Theory of Computing (STOC), pp. 327--336. ACM, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. A. Z. Broder, S. C. Glassman, M. S. Manasse, and G. Zweig. Syntactic clustering of the web. Comput. Netw. ISDN Syst., 29(8-13):1157--1166, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. S. Chakrabarti. Mining the Web. Morgan Kaufmann, 2003.Google ScholarGoogle Scholar
  13. M. S. Charikar. Similarity estimation techniques from rounding algorithms. In Proc. Symp. Theory of Computing (STOC), pp. 380--388. ACM, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. A. Dasdan, K. Tsioutsiouliklis, and E. Velipasaoglu. Web search engine metrics. Tutorial in Int. Conf. World Wide Web (WWW), ACM, 2009.Google ScholarGoogle Scholar
  15. R. Ghani, R. Jones, and D. Mladenic. Automatic web search query generation to create minority language corpora. In Proc. of Conf. on Research and Dev. in Info. Retrieval (SIGIR), pp. 432--433. ACM, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. R. L. Graham, D. E. Knuth, and O. Patashnik. Concrete Mathematics. Addison-Wesley, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. M. Henzinger. Finding near-duplicate web pp.: a large-scale evaluation of algorithms. In Proc. of Conf. on Research and Dev. in Info. Retrieval (SIGIR), pp. 284--291. ACM, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. K. Järvelin and J. Kekäläinen. Cumulated gain-based evaluation of IR techniques. ACM Trans. Inf. Syst., 20(4):422--446, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. H. Luhn. A statistical approach to mechanized encoding and searching of literary information. IBM J. Research and Development, 1(4):309--317, 1957.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. H. Luhn. The automatic creation of literature abstracts. IBM J. Research and Development, 2(2), 1958.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. G. S. Manku, A. Jain, and A. D. Sarma. Detecting near-duplicates for web crawling. In Proc. Int. Conf. World Wide Web (WWW), pp. 141--150. ACM, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. F. D. McSherry, K. Talwar, and M. D. Manasse. Consistent weighted sampling of multisets and distributions. U.S. Patent Appl., Sep 2008.Google ScholarGoogle Scholar
  23. R. Motwani and P. Raghavan. Randomized Algorithms. Cambridge University Press, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. T. Noreault, M. McGill, and M. Koll. A performance evaluation of similarity measures, document term weighting schemes and representations in a boolean environment. In Proc. of Conf. on Research and Dev. in Info. Retrieval (SIGIR), pp. 57--76. ACM, 1981. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. A. Pereira Jr. and N. Ziviani. Retrieving similar documents from the Web. J. Web Engineering, 2(4):247--261, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. N. Shivakumar and H. García-Molina. SCAM: A copy detection mechanism for digital documents. In Proc. Int. Conf. Digital Libraries (DL). ACM, 1995.Google ScholarGoogle Scholar
  27. N. Shivakumar and H. García-Molina. Building a scalable and accurate copy detection mechanism. In Proc. Int. Conf. Digital Libraries (DL), pp. 160--168. ACM, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. D. R. Swanson. Historical note: Information retrieval and the future of an illusion. J. American Society for Information Science, 39(2):92--98, 1988.Google ScholarGoogle ScholarCross RefCross Ref
  29. Q. Tan, Z. Zhuang, P. Mitra, and C. L. Giles. Designing efficient sampling techniques to detect webpage updates. In Proc. Int. Conf. World Wide Web (WWW), pp. 1147--1148. ACM, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Y. Yang, N. Bansal, W. Dakka, P. Ipeirotis, N. Koudas, and D. Papadias. Query by document. In Proc. Int. Conf. Web Search and Data Mining, pp. 34--43. ACM, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Automatic retrieval of similar content using search engine query interface

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      CIKM '09: Proceedings of the 18th ACM conference on Information and knowledge management
      November 2009
      2162 pages
      ISBN:9781605585123
      DOI:10.1145/1645953

      Copyright © 2009 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 2 November 2009

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate1,861of8,427submissions,22%

      Upcoming Conference

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader