skip to main content
research-article

Detecting visually similar Web pages: Application to phishing detection

Published:10 June 2010Publication History
Skip Abstract Section

Abstract

We propose a novel approach for detecting visual similarity between two Web pages. The proposed approach applies Gestalt theory and considers a Web page as a single indivisible entity. The concept of supersignals, as a realization of Gestalt principles, supports our contention that Web pages must be treated as indivisible entities. We objectify, and directly compare, these indivisible supersignals using algorithmic complexity theory. We illustrate our approach by applying it to the problem of detecting phishing scams. Via a large-scale, real-world case study, we demonstrate that 1) our approach effectively detects similar Web pages; and 2) it accuractely distinguishes legitimate and phishing pages.

References

  1. Andresen, D., Yang, T., Egecioglu, O., Ibarra, O. H., and Smith, T. R. 1996. Scalability issues for high performance digital libraries on the World Wide Web. In Proceedings of the IEEE Forum on Research and Technology Advances in Digital Libraries. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. APWG. 2008. Phishing Attack Trends Report (Jan.). Anti-Phishing Working Group, http://www.antiphishing.org.Google ScholarGoogle Scholar
  3. APWG. 2009. APWG. The Anti-Phishing Working Group, http://www.antiphishing.org.Google ScholarGoogle Scholar
  4. Avidan, S. and Shamir, A. 2007. Seam carving for content-aware image resizing. ACM Trans. Graph. 26, 3, 10, 1--9. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Baldi, P., Brunak, S., Chauvin, Y., Andersen, C. A. F., and Nielsen, H. 2000. Assessing the accuracy of prediction algorithms for classification: An overview. Bioinform. Rev. 16, 5, 412--424.Google ScholarGoogle ScholarCross RefCross Ref
  6. Bardera, A., Feixas, M., Boada, I., and Sbert, M. 2006. Compression-based image registration. In Proceedings of the IEEE International Symposium on Information Theory.Google ScholarGoogle Scholar
  7. Batista, L. V., Meira, M. M., and Canalcanti jr., N. L. 2005. Texture classification using local and global histogram equalization and the Lempel-Ziv-Welch algorithm. In Proceedings of the 5th International Conference on Hybrid Intelligent Systems. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Bell, T., Cleary, J., and Witten, I. 1984. Data compression using adaptive coding and partial string matching. IEEE Trans Comm. 32, 4, 396--402.Google ScholarGoogle ScholarCross RefCross Ref
  9. Brandao, T. and M. P. Queluz. 2008. No-reference image quality assessment based on DCT domain statistics. Signal Process. 88, 4, 822--833. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Broder, A. Z., Glassman, S. C., Manasse, M. S., and Zweig, G. 1997. Syntactic clustering of the Web. Comput. Netw. ISDN Syst. 29, 8--13, 1157--1166. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Burrows, M. and Wheeler, D. J. 1994. A block-sorting loss less data compression algorithm. Tech. rep., Digital Systems Research Center.Google ScholarGoogle Scholar
  12. Cai, D., Yu, S., Wen, J. R., and Ma, W. Y. 2003. Extracting content structure for Web pages based on visual representation. In Proceedings of the 5th Asian-Pacific Web Conference on Web Technologies and Applications. Lecture Notes in Computer Science, vol. 2642, 406--417. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Cebrian, M., Alfonseca, M., and Ortega, A. 2005. Common pitfalls using the normalized compression distance: What to watch out for in a compressor. Comm. Inform. Syst. 54, 367--384.Google ScholarGoogle Scholar
  14. Cebrian, M., Alfonseca, M., and Ortega, A. 2007. The normalized compression distance is resistant to noise. IEEE Trans. Inform. Theory 53, 5, 1895--1900.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Cernian, A., Carstoiu, D., and Olteanu, A. 2008. Clustering heterogeneous Web data using clustering bv compression validity. In Proceedings of the International Symposium on Symbolic and Numeric Algorithms for Scientific Computing. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Chaitin, G. 1. 1987. Algorithmic Information Theory. Cambridge University Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Charikar, M. S. 2002. Similarity estimation techniques from rounding algorithms. In Proceedings of the ACM Symposium on Theory of Computing. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Chou, N., Ledesma, R., Teraguchi, Y., Boneh, D., and Mitchell, J. C. 2004. Client-side defense against Web-based identity theft. In Proceedings of the Annual Network and Distributed System Security Symposium.Google ScholarGoogle Scholar
  19. Cilibrasi, R. and Vitanyi, P. M. B. 2005. Clustering by compression. IEEE Trans. Inform. Theory 51, 4, 1523--545. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Cranor, L., Egelman, S., Hong, J., and Zhang, Y. 2007. Phinding phish: Evaluating anti-phishing toolbars. In Proceedings of the Annual Network and Distributed System Security Symposium.Google ScholarGoogle Scholar
  21. Dean, J. and Henzinger, M. R. 1999. Finding related pages in the World Wide Web. Comput. Netw. 31, 11--16, 1467--1479. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Delany, S. J. and Bridge, D. 2006. Textual case-based reasoning for spam filtering: A comparison of feature-based and feature-free approaches. Artif. Intell. Rev. 26, 75--87. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Dhamija, R. and Tygar, J. D. 2006. Why phishing works. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Dhamija. R. and Tygar, J. D. 2005. The battle against phishing: Dynamic security skins. In Proceedings of the Symposium on Usable Privacy and Security. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Dorner, D. 1997. The Logic of Failure. Metropolitan Books, Cambridge. MA.Google ScholarGoogle Scholar
  26. DSLReports.com. 2008. Phish tracker. http://www.dslreports.com/phishtrack.Google ScholarGoogle Scholar
  27. eBay. 2008. Weleome to eBay. https://signin.ebay.com/ws/eBayISAPl.dll?Signln&ru=http%%3A %2F%2F.Google ScholarGoogle Scholar
  28. Emigh, A. 2005. Online identity theft: Phishing technology, chokepoints and countermeasures. Tech rep., Radix Labs.Google ScholarGoogle Scholar
  29. Feldt, R., Torkar, R., Gorschek, T., and Afzal, W. 2008. Searching for cognitively diverse tests: Towards universal test diversity metrics. In Proceedings of the IEEE International Conference on Software Testing Verification and Validation Workshop. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Fette, I., Sadeh, N., and Tomasic, A. 2007. Learning to detect phishing emails. In Proceedings of the International World Wide Web Conference. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Florencio, D. and Herley, C. 2005. Stopping a phishing attack, even when the victims ignore warnings. Tech. rep., Microsoft Research., Redmond, WA.Google ScholarGoogle Scholar
  32. Fu, A. Y., Wenyin, L., and Deng, X. 2006. Detecting phishing Web pages with visual similarity assessment based on earth mover's distance (EMD). IEEE Trans. Depend. Secure Comput. 3, 4, 301--311. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Gordon, I. E. 2004. Theories of Visual Perception, 3rd Ed. Psychology Press, New York.Google ScholarGoogle Scholar
  34. Graham, L. 2008. Gestalt theory in interactive media design. Human. Soc. Sci. 2, 1, 3.1--3.12.Google ScholarGoogle Scholar
  35. Granados, A., Cebrian, M., Camacho, D., and Rodriguez, F. B. 2008. Evaluating the impact of information distortion on normalized compression distance. In Proceedings of the 2nd International Castle Meeting on Coding Theory and Applications. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Haveliwala, T. H., Gionis, A., Klein, D., and Indyk, P. 2002. Evaluating strategies for similarity search on the Web. In Proceedings of the International World Wide Web Conference. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Heintze, L. 1996. Scalable document fingerprinting. In Proceedings of the USENIX Workshop on Electronic Commerce.Google ScholarGoogle Scholar
  38. Henzinger, M. 2006. Finding near-duplicate Web pages: A large-scale evaluation of algorithms. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Hescott, B. and Koulomzin, D. 2007. On clustering images using compression. Tech. rep., Computer Science Department, Boston University.Google ScholarGoogle Scholar
  40. Hou, 1. and Zhang, Y. 2003. Utilizing hyperlink transitivity to improve Web page clustering. In Proceedings of the Australasian Database Conference. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Kalviainen, M. 2007. The role of sign elements in holistic product meaning. In Proceedings of the SeFun International Seminar on Design Semiotics in Use.Google ScholarGoogle Scholar
  42. Kepes, G. 1944. Language of Vision. Paul Theobald, Chicago, IL.Google ScholarGoogle Scholar
  43. Lan, Y. and Harvey, R. 2005. Image classification using compression distance. In Proceedings of the 2nd International Conference on Vision, Video and Graphics.Google ScholarGoogle Scholar
  44. Li, M. and Vitanyi, P. 1997. An Introduction to Kolmogorov Complexity and its Applications, 2nd Ed. Springer-Verlag, Berlin. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Li, M. and Zhu, Y. 2006. Image classification via LZ78-based string kernel: A comparative study. In Proceedings of the 10th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining. Lecture Notes in Computer Science, vol. 3918, 704--712. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Li, M., Chen, X., Li, X., Ma, B., and Vitanyi, P. M. B. 2004. The similarity metric. IEEE Trans. Inform. Theory 50, 12, 3250--3264. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Macedonas, A., Besiris, D., Economou, G., and Fotopoulos, S. 2008. Dictionary based color image retrieval. J. Vis. Comm. Image Rep. 19, 464--470. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Mack, A. and Rock, I. 1998a. lnattentional Blindness. MIT Press.Google ScholarGoogle Scholar
  49. Mack, A. and Rock, I. 1998b. Inattentional blindness: Perception without attention. In Visual Attention, R. D. Wright Ed., Oxford University Press, Oxford, UK, 55--76.Google ScholarGoogle Scholar
  50. MacKay, W. E. 1991. Triggers and barriers to customizing software. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Manber, U. 1994. Finding similar files in a large file system. In Proceedings of the USENIX Winter Technical Conference. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. McCall, I. 2007. Gartner survey shows phishing attacks escalated in 2007: More than &doller; 3 Billion lost to these attacks. Gartner, Inc., http://www.gartner.comit.pclgc.jsp?id=565l25.Google ScholarGoogle Scholar
  53. Microsoft. 2009. Get Internet Explorer 7. http://www.microsoft.eom/windows/internet-explorer/ie7.Google ScholarGoogle Scholar
  54. Mozilla. 2008. FireFox Web Brower. http://www.mozilla.com.en-US firefox/.Google ScholarGoogle Scholar
  55. Mozilla. 2009. Thunderbird—Reclaim Your Inbox. http://www.mozilla.com/en-US/thunderbird.Google ScholarGoogle Scholar
  56. Netcraft. 2009. Netcraft Anti-Phishing Toolbar. http: toolbar.netcraft.com.Google ScholarGoogle Scholar
  57. Ofuonye, E., Beatty, P., Dick, S., and Miller, J. 2010. Prevalence and classification of Web page defects. Online Inform. Rev. 34, 1, 160--174.Google ScholarGoogle ScholarCross RefCross Ref
  58. OpenDNS. 2008. PhishTank. Join the fight against phishing. http://www.phishtank.com/phish_archive.php.Google ScholarGoogle Scholar
  59. Pavlov, I. 2009. 7z Format. 7Zip, http://www.7-zip.org/.Google ScholarGoogle Scholar
  60. Provost, F., Fawcett, T., and Kohavi, R. 1998. The case against accuracy estimation for comparing induction algorithms. In Proceedings of the International Conference on Machine Learning. Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. Quiney, H. M., Nugent, K. A., and Peele, A. G. 2006. Iterative image reconstruction algorithms using wave-front intensity and phase variation. Optics Lett. 30, 13, 1638--1640.Google ScholarGoogle ScholarCross RefCross Ref
  62. Rosiello, A. P. E., Kirda, E., Kruegel, C., and Ferrandi, F. 2007. A layout-similaritv-based approach for detecting phishing pages. In Proceedings of the IEEE International Conference on Security and Privacy in Communications Networks and the Workshops.Google ScholarGoogle Scholar
  63. Rourke, L., Anderson, T., Garrison, D. R., and Archer, W. 2001. Methodological issues in the content analysis of computer conference transcripts. Int. J. Artif. Intel. Educ. 12, 8--22.Google ScholarGoogle Scholar
  64. RSA. 2009. RSA Identity Protection and Veritication Suite. http://www.rsa.eom!node.aspx?id=30l7.Google ScholarGoogle Scholar
  65. Salomon, D. 2007. Data Compression: The Complete Reference. Springer-Verlag. Google ScholarGoogle ScholarDigital LibraryDigital Library
  66. Sheikh, H. R., Bovik, A. C., and Cormack, L. K. 2005. No-reference quality assessment using natural scene statistics JPEG2000. IEEE Trans. Image Process. 14, 11, 1918--1927.Google ScholarGoogle ScholarDigital LibraryDigital Library
  67. Sheikh, H. R., Sabir, M. F., and Bovik, A. C. 2006. A statistical evaluation of recent full reference image quality assessment algorithms. IEEE Trans. Image Process. 15, 11, 3449--3451. Google ScholarGoogle ScholarDigital LibraryDigital Library
  68. Shen, D., Chen, Z., Yang, Q., Zeng, H.-J., Zhang, B., Lu, Y., and Ma, W.-Y. 2004. Web-page classification through summarization. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. Google ScholarGoogle ScholarDigital LibraryDigital Library
  69. Strimmer, K. and von Haeseler, A. 1996. Quartet puzzling: A quartet maximum-likelihood method for reconstructing tree topologies. Molec. Biol. Evol. 13, 7, 964--969.Google ScholarGoogle ScholarCross RefCross Ref
  70. Toet, A. and Lucassen, M. P. 2003. A new universal colour image fidelity metric. Displays 24, 4--5, 197--207.Google ScholarGoogle ScholarCross RefCross Ref
  71. Venkatesh Babu, R., Suresh, S., and Perkis, A. 2007. No-reference JPEG-image quality assessment using GAP-RBF. Signal Process. 87, 6, 1493--1503. Google ScholarGoogle ScholarDigital LibraryDigital Library
  72. Wang, Y. and Kitsuregawa, M. 2002. Evaluating contents-link coupled Web page clustering for Web search results. In Proceedings of the International Conference on Information and Knowledge Management. Google ScholarGoogle ScholarDigital LibraryDigital Library
  73. Wang, Z., Bovik, A. C., Sheikh, H. R., and Simoncelli, E. P. 2004. Image quality assessment: From error visibility to structural simihrity. IEEE Trans. Image Process. 13, 4, 600--612. Google ScholarGoogle ScholarDigital LibraryDigital Library
  74. Wang, Z., Simoncelli, E. P., and Bovik, A. C. 2003. Translation insensitive image similarity for image quality assessment. In Proceedings of the IEEE Asilomar Conference on Signals, Systems and Computers.Google ScholarGoogle Scholar
  75. Wertheimer, M. 1944. Gestalt Theory. Hayes Barton Press, New York.Google ScholarGoogle Scholar
  76. Wu, C.-T., Cheng, K.-T., Zhu, Q., and Wu, Y.-L. 2005. Using visual features for anti-spam filtering. In Proceedings of the IEEE International Conference on Image Processing.Google ScholarGoogle Scholar
  77. Wu, M., Miller, R. C., and Garfinkel, S. L. 2006. Do security toolbars actually prevent phishing attacks? In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. Google ScholarGoogle ScholarDigital LibraryDigital Library
  78. Xiang. G. and Hong, J. 2009. A hybrid phish detection approach by identity discovery and keywords retrieval. In Proceedings of the International World-Wide Web Conference. Google ScholarGoogle ScholarDigital LibraryDigital Library
  79. Yahoo. 2009. Yahoo! Personalized Sign-In Seal. https://protect.login.yahoo.com.Google ScholarGoogle Scholar
  80. Yih, W., 1. Goodman, J., and Hulten, G. 2006. Learning at low false positive rates. In Proceedings of the 3rd Conference on Email and AntiSpam.Google ScholarGoogle Scholar
  81. Zhang, Y., Hong, J. and Cranor, L. 2007. CANTINA: A content-based approach to detecting phishing Web sites. In Proceedings of the International World-Wide Web Conference. Google ScholarGoogle ScholarDigital LibraryDigital Library
  82. Ziv. J. and Lempel, A. 1977. A universal algorithm for sequential data compression. IEEE Trans. Inform. Theory 23, 3, 337--343.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Detecting visually similar Web pages: Application to phishing detection

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Internet Technology
        ACM Transactions on Internet Technology  Volume 10, Issue 2
        May 2010
        123 pages
        ISSN:1533-5399
        EISSN:1557-6051
        DOI:10.1145/1754393
        Issue’s Table of Contents

        Copyright © 2010 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 10 June 2010
        • Accepted: 1 December 2009
        • Revised: 1 June 2009
        • Received: 1 January 2009
        Published in toit Volume 10, Issue 2

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Research
        • Refereed

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader