Abstract
We propose a novel approach for detecting visual similarity between two Web pages. The proposed approach applies Gestalt theory and considers a Web page as a single indivisible entity. The concept of supersignals, as a realization of Gestalt principles, supports our contention that Web pages must be treated as indivisible entities. We objectify, and directly compare, these indivisible supersignals using algorithmic complexity theory. We illustrate our approach by applying it to the problem of detecting phishing scams. Via a large-scale, real-world case study, we demonstrate that 1) our approach effectively detects similar Web pages; and 2) it accuractely distinguishes legitimate and phishing pages.
- Andresen, D., Yang, T., Egecioglu, O., Ibarra, O. H., and Smith, T. R. 1996. Scalability issues for high performance digital libraries on the World Wide Web. In Proceedings of the IEEE Forum on Research and Technology Advances in Digital Libraries. Google ScholarDigital Library
- APWG. 2008. Phishing Attack Trends Report (Jan.). Anti-Phishing Working Group, http://www.antiphishing.org.Google Scholar
- APWG. 2009. APWG. The Anti-Phishing Working Group, http://www.antiphishing.org.Google Scholar
- Avidan, S. and Shamir, A. 2007. Seam carving for content-aware image resizing. ACM Trans. Graph. 26, 3, 10, 1--9. Google ScholarDigital Library
- Baldi, P., Brunak, S., Chauvin, Y., Andersen, C. A. F., and Nielsen, H. 2000. Assessing the accuracy of prediction algorithms for classification: An overview. Bioinform. Rev. 16, 5, 412--424.Google ScholarCross Ref
- Bardera, A., Feixas, M., Boada, I., and Sbert, M. 2006. Compression-based image registration. In Proceedings of the IEEE International Symposium on Information Theory.Google Scholar
- Batista, L. V., Meira, M. M., and Canalcanti jr., N. L. 2005. Texture classification using local and global histogram equalization and the Lempel-Ziv-Welch algorithm. In Proceedings of the 5th International Conference on Hybrid Intelligent Systems. Google ScholarDigital Library
- Bell, T., Cleary, J., and Witten, I. 1984. Data compression using adaptive coding and partial string matching. IEEE Trans Comm. 32, 4, 396--402.Google ScholarCross Ref
- Brandao, T. and M. P. Queluz. 2008. No-reference image quality assessment based on DCT domain statistics. Signal Process. 88, 4, 822--833. Google ScholarDigital Library
- Broder, A. Z., Glassman, S. C., Manasse, M. S., and Zweig, G. 1997. Syntactic clustering of the Web. Comput. Netw. ISDN Syst. 29, 8--13, 1157--1166. Google ScholarDigital Library
- Burrows, M. and Wheeler, D. J. 1994. A block-sorting loss less data compression algorithm. Tech. rep., Digital Systems Research Center.Google Scholar
- Cai, D., Yu, S., Wen, J. R., and Ma, W. Y. 2003. Extracting content structure for Web pages based on visual representation. In Proceedings of the 5th Asian-Pacific Web Conference on Web Technologies and Applications. Lecture Notes in Computer Science, vol. 2642, 406--417. Google ScholarDigital Library
- Cebrian, M., Alfonseca, M., and Ortega, A. 2005. Common pitfalls using the normalized compression distance: What to watch out for in a compressor. Comm. Inform. Syst. 54, 367--384.Google Scholar
- Cebrian, M., Alfonseca, M., and Ortega, A. 2007. The normalized compression distance is resistant to noise. IEEE Trans. Inform. Theory 53, 5, 1895--1900.Google ScholarDigital Library
- Cernian, A., Carstoiu, D., and Olteanu, A. 2008. Clustering heterogeneous Web data using clustering bv compression validity. In Proceedings of the International Symposium on Symbolic and Numeric Algorithms for Scientific Computing. Google ScholarDigital Library
- Chaitin, G. 1. 1987. Algorithmic Information Theory. Cambridge University Press. Google ScholarDigital Library
- Charikar, M. S. 2002. Similarity estimation techniques from rounding algorithms. In Proceedings of the ACM Symposium on Theory of Computing. Google ScholarDigital Library
- Chou, N., Ledesma, R., Teraguchi, Y., Boneh, D., and Mitchell, J. C. 2004. Client-side defense against Web-based identity theft. In Proceedings of the Annual Network and Distributed System Security Symposium.Google Scholar
- Cilibrasi, R. and Vitanyi, P. M. B. 2005. Clustering by compression. IEEE Trans. Inform. Theory 51, 4, 1523--545. Google ScholarDigital Library
- Cranor, L., Egelman, S., Hong, J., and Zhang, Y. 2007. Phinding phish: Evaluating anti-phishing toolbars. In Proceedings of the Annual Network and Distributed System Security Symposium.Google Scholar
- Dean, J. and Henzinger, M. R. 1999. Finding related pages in the World Wide Web. Comput. Netw. 31, 11--16, 1467--1479. Google ScholarDigital Library
- Delany, S. J. and Bridge, D. 2006. Textual case-based reasoning for spam filtering: A comparison of feature-based and feature-free approaches. Artif. Intell. Rev. 26, 75--87. Google ScholarDigital Library
- Dhamija, R. and Tygar, J. D. 2006. Why phishing works. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. Google ScholarDigital Library
- Dhamija. R. and Tygar, J. D. 2005. The battle against phishing: Dynamic security skins. In Proceedings of the Symposium on Usable Privacy and Security. Google ScholarDigital Library
- Dorner, D. 1997. The Logic of Failure. Metropolitan Books, Cambridge. MA.Google Scholar
- DSLReports.com. 2008. Phish tracker. http://www.dslreports.com/phishtrack.Google Scholar
- eBay. 2008. Weleome to eBay. https://signin.ebay.com/ws/eBayISAPl.dll?Signln&ru=http%%3A %2F%2F.Google Scholar
- Emigh, A. 2005. Online identity theft: Phishing technology, chokepoints and countermeasures. Tech rep., Radix Labs.Google Scholar
- Feldt, R., Torkar, R., Gorschek, T., and Afzal, W. 2008. Searching for cognitively diverse tests: Towards universal test diversity metrics. In Proceedings of the IEEE International Conference on Software Testing Verification and Validation Workshop. Google ScholarDigital Library
- Fette, I., Sadeh, N., and Tomasic, A. 2007. Learning to detect phishing emails. In Proceedings of the International World Wide Web Conference. Google ScholarDigital Library
- Florencio, D. and Herley, C. 2005. Stopping a phishing attack, even when the victims ignore warnings. Tech. rep., Microsoft Research., Redmond, WA.Google Scholar
- Fu, A. Y., Wenyin, L., and Deng, X. 2006. Detecting phishing Web pages with visual similarity assessment based on earth mover's distance (EMD). IEEE Trans. Depend. Secure Comput. 3, 4, 301--311. Google ScholarDigital Library
- Gordon, I. E. 2004. Theories of Visual Perception, 3rd Ed. Psychology Press, New York.Google Scholar
- Graham, L. 2008. Gestalt theory in interactive media design. Human. Soc. Sci. 2, 1, 3.1--3.12.Google Scholar
- Granados, A., Cebrian, M., Camacho, D., and Rodriguez, F. B. 2008. Evaluating the impact of information distortion on normalized compression distance. In Proceedings of the 2nd International Castle Meeting on Coding Theory and Applications. Google ScholarDigital Library
- Haveliwala, T. H., Gionis, A., Klein, D., and Indyk, P. 2002. Evaluating strategies for similarity search on the Web. In Proceedings of the International World Wide Web Conference. Google ScholarDigital Library
- Heintze, L. 1996. Scalable document fingerprinting. In Proceedings of the USENIX Workshop on Electronic Commerce.Google Scholar
- Henzinger, M. 2006. Finding near-duplicate Web pages: A large-scale evaluation of algorithms. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. Google ScholarDigital Library
- Hescott, B. and Koulomzin, D. 2007. On clustering images using compression. Tech. rep., Computer Science Department, Boston University.Google Scholar
- Hou, 1. and Zhang, Y. 2003. Utilizing hyperlink transitivity to improve Web page clustering. In Proceedings of the Australasian Database Conference. Google ScholarDigital Library
- Kalviainen, M. 2007. The role of sign elements in holistic product meaning. In Proceedings of the SeFun International Seminar on Design Semiotics in Use.Google Scholar
- Kepes, G. 1944. Language of Vision. Paul Theobald, Chicago, IL.Google Scholar
- Lan, Y. and Harvey, R. 2005. Image classification using compression distance. In Proceedings of the 2nd International Conference on Vision, Video and Graphics.Google Scholar
- Li, M. and Vitanyi, P. 1997. An Introduction to Kolmogorov Complexity and its Applications, 2nd Ed. Springer-Verlag, Berlin. Google ScholarDigital Library
- Li, M. and Zhu, Y. 2006. Image classification via LZ78-based string kernel: A comparative study. In Proceedings of the 10th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining. Lecture Notes in Computer Science, vol. 3918, 704--712. Google ScholarDigital Library
- Li, M., Chen, X., Li, X., Ma, B., and Vitanyi, P. M. B. 2004. The similarity metric. IEEE Trans. Inform. Theory 50, 12, 3250--3264. Google ScholarDigital Library
- Macedonas, A., Besiris, D., Economou, G., and Fotopoulos, S. 2008. Dictionary based color image retrieval. J. Vis. Comm. Image Rep. 19, 464--470. Google ScholarDigital Library
- Mack, A. and Rock, I. 1998a. lnattentional Blindness. MIT Press.Google Scholar
- Mack, A. and Rock, I. 1998b. Inattentional blindness: Perception without attention. In Visual Attention, R. D. Wright Ed., Oxford University Press, Oxford, UK, 55--76.Google Scholar
- MacKay, W. E. 1991. Triggers and barriers to customizing software. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. Google ScholarDigital Library
- Manber, U. 1994. Finding similar files in a large file system. In Proceedings of the USENIX Winter Technical Conference. Google ScholarDigital Library
- McCall, I. 2007. Gartner survey shows phishing attacks escalated in 2007: More than &doller; 3 Billion lost to these attacks. Gartner, Inc., http://www.gartner.comit.pclgc.jsp?id=565l25.Google Scholar
- Microsoft. 2009. Get Internet Explorer 7. http://www.microsoft.eom/windows/internet-explorer/ie7.Google Scholar
- Mozilla. 2008. FireFox Web Brower. http://www.mozilla.com.en-US firefox/.Google Scholar
- Mozilla. 2009. Thunderbird—Reclaim Your Inbox. http://www.mozilla.com/en-US/thunderbird.Google Scholar
- Netcraft. 2009. Netcraft Anti-Phishing Toolbar. http: toolbar.netcraft.com.Google Scholar
- Ofuonye, E., Beatty, P., Dick, S., and Miller, J. 2010. Prevalence and classification of Web page defects. Online Inform. Rev. 34, 1, 160--174.Google ScholarCross Ref
- OpenDNS. 2008. PhishTank. Join the fight against phishing. http://www.phishtank.com/phish_archive.php.Google Scholar
- Pavlov, I. 2009. 7z Format. 7Zip, http://www.7-zip.org/.Google Scholar
- Provost, F., Fawcett, T., and Kohavi, R. 1998. The case against accuracy estimation for comparing induction algorithms. In Proceedings of the International Conference on Machine Learning. Google ScholarDigital Library
- Quiney, H. M., Nugent, K. A., and Peele, A. G. 2006. Iterative image reconstruction algorithms using wave-front intensity and phase variation. Optics Lett. 30, 13, 1638--1640.Google ScholarCross Ref
- Rosiello, A. P. E., Kirda, E., Kruegel, C., and Ferrandi, F. 2007. A layout-similaritv-based approach for detecting phishing pages. In Proceedings of the IEEE International Conference on Security and Privacy in Communications Networks and the Workshops.Google Scholar
- Rourke, L., Anderson, T., Garrison, D. R., and Archer, W. 2001. Methodological issues in the content analysis of computer conference transcripts. Int. J. Artif. Intel. Educ. 12, 8--22.Google Scholar
- RSA. 2009. RSA Identity Protection and Veritication Suite. http://www.rsa.eom!node.aspx?id=30l7.Google Scholar
- Salomon, D. 2007. Data Compression: The Complete Reference. Springer-Verlag. Google ScholarDigital Library
- Sheikh, H. R., Bovik, A. C., and Cormack, L. K. 2005. No-reference quality assessment using natural scene statistics JPEG2000. IEEE Trans. Image Process. 14, 11, 1918--1927.Google ScholarDigital Library
- Sheikh, H. R., Sabir, M. F., and Bovik, A. C. 2006. A statistical evaluation of recent full reference image quality assessment algorithms. IEEE Trans. Image Process. 15, 11, 3449--3451. Google ScholarDigital Library
- Shen, D., Chen, Z., Yang, Q., Zeng, H.-J., Zhang, B., Lu, Y., and Ma, W.-Y. 2004. Web-page classification through summarization. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. Google ScholarDigital Library
- Strimmer, K. and von Haeseler, A. 1996. Quartet puzzling: A quartet maximum-likelihood method for reconstructing tree topologies. Molec. Biol. Evol. 13, 7, 964--969.Google ScholarCross Ref
- Toet, A. and Lucassen, M. P. 2003. A new universal colour image fidelity metric. Displays 24, 4--5, 197--207.Google ScholarCross Ref
- Venkatesh Babu, R., Suresh, S., and Perkis, A. 2007. No-reference JPEG-image quality assessment using GAP-RBF. Signal Process. 87, 6, 1493--1503. Google ScholarDigital Library
- Wang, Y. and Kitsuregawa, M. 2002. Evaluating contents-link coupled Web page clustering for Web search results. In Proceedings of the International Conference on Information and Knowledge Management. Google ScholarDigital Library
- Wang, Z., Bovik, A. C., Sheikh, H. R., and Simoncelli, E. P. 2004. Image quality assessment: From error visibility to structural simihrity. IEEE Trans. Image Process. 13, 4, 600--612. Google ScholarDigital Library
- Wang, Z., Simoncelli, E. P., and Bovik, A. C. 2003. Translation insensitive image similarity for image quality assessment. In Proceedings of the IEEE Asilomar Conference on Signals, Systems and Computers.Google Scholar
- Wertheimer, M. 1944. Gestalt Theory. Hayes Barton Press, New York.Google Scholar
- Wu, C.-T., Cheng, K.-T., Zhu, Q., and Wu, Y.-L. 2005. Using visual features for anti-spam filtering. In Proceedings of the IEEE International Conference on Image Processing.Google Scholar
- Wu, M., Miller, R. C., and Garfinkel, S. L. 2006. Do security toolbars actually prevent phishing attacks? In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. Google ScholarDigital Library
- Xiang. G. and Hong, J. 2009. A hybrid phish detection approach by identity discovery and keywords retrieval. In Proceedings of the International World-Wide Web Conference. Google ScholarDigital Library
- Yahoo. 2009. Yahoo! Personalized Sign-In Seal. https://protect.login.yahoo.com.Google Scholar
- Yih, W., 1. Goodman, J., and Hulten, G. 2006. Learning at low false positive rates. In Proceedings of the 3rd Conference on Email and AntiSpam.Google Scholar
- Zhang, Y., Hong, J. and Cranor, L. 2007. CANTINA: A content-based approach to detecting phishing Web sites. In Proceedings of the International World-Wide Web Conference. Google ScholarDigital Library
- Ziv. J. and Lempel, A. 1977. A universal algorithm for sequential data compression. IEEE Trans. Inform. Theory 23, 3, 337--343.Google ScholarDigital Library
Index Terms
- Detecting visually similar Web pages: Application to phishing detection
Recommendations
Detecting phishing web pages based on DOM-tree structure and graph matching algorithm
SoICT '14: Proceedings of the 5th Symposium on Information and Communication TechnologyMost modern day phishing attacks occur by luring users into visiting a malicious web page that looks and behaves like the original. Phishing is a web-based attack which end users are lured to visit fraudulent websites and give away personal information ...
A survey of Web metrics
The unabated growth and increasing significance of the World Wide Web has resulted in a flurry of research activity to improve its capacity for serving information more effectively. But at the heart of these efforts lie implicit assumptions about "...
Utilizing hyperlink transitivity to improve web page clustering
ADC '03: Proceedings of the 14th Australasian database conference - Volume 17The rapid increase of web complexity and size makes web searched results far from satisfaction in many cases due to a huge amount of information returned by search engines. How to find intrinsic relationships among the web pages at a higher level to ...
Comments