research-article

Detecting visually similar Web pages: Application to phishing detection

Authors:
Teh-Chung Chen

University of Alberta, Canada

University of Alberta, Canada
View Profile

,
Scott Dick

University of Alberta, Canada

University of Alberta, Canada
View Profile

,
James Miller

University of Alberta, Canada

University of Alberta, Canada
View Profile

Authors Info & Claims

ACM Transactions on Internet Technology Volume 10 Issue 2Article No.: 5pp 1–38https://doi.org/10.1145/1754393.1754394

Published:10 June 2010Publication History

ACM Transactions on Internet Technology

Abstract

We propose a novel approach for detecting visual similarity between two Web pages. The proposed approach applies Gestalt theory and considers a Web page as a single indivisible entity. The concept of supersignals, as a realization of Gestalt principles, supports our contention that Web pages must be treated as indivisible entities. We objectify, and directly compare, these indivisible supersignals using algorithmic complexity theory. We illustrate our approach by applying it to the problem of detecting phishing scams. Via a large-scale, real-world case study, we demonstrate that 1) our approach effectively detects similar Web pages; and 2) it accuractely distinguishes legitimate and phishing pages.

References

Andresen, D., Yang, T., Egecioglu, O., Ibarra, O. H., and Smith, T. R. 1996. Scalability issues for high performance digital libraries on the World Wide Web. In Proceedings of the IEEE Forum on Research and Technology Advances in Digital Libraries. Google ScholarDigital Library
APWG. 2008. Phishing Attack Trends Report (Jan.). Anti-Phishing Working Group, http://www.antiphishing.org.Google Scholar
APWG. 2009. APWG. The Anti-Phishing Working Group, http://www.antiphishing.org.Google Scholar
Avidan, S. and Shamir, A. 2007. Seam carving for content-aware image resizing. ACM Trans. Graph. 26, 3, 10, 1--9. Google ScholarDigital Library
Baldi, P., Brunak, S., Chauvin, Y., Andersen, C. A. F., and Nielsen, H. 2000. Assessing the accuracy of prediction algorithms for classification: An overview. Bioinform. Rev. 16, 5, 412--424.Google ScholarCross Ref
Bardera, A., Feixas, M., Boada, I., and Sbert, M. 2006. Compression-based image registration. In Proceedings of the IEEE International Symposium on Information Theory.Google Scholar
Batista, L. V., Meira, M. M., and Canalcanti jr., N. L. 2005. Texture classification using local and global histogram equalization and the Lempel-Ziv-Welch algorithm. In Proceedings of the 5th International Conference on Hybrid Intelligent Systems. Google ScholarDigital Library
Bell, T., Cleary, J., and Witten, I. 1984. Data compression using adaptive coding and partial string matching. IEEE Trans Comm. 32, 4, 396--402.Google ScholarCross Ref
Brandao, T. and M. P. Queluz. 2008. No-reference image quality assessment based on DCT domain statistics. Signal Process. 88, 4, 822--833. Google ScholarDigital Library
Broder, A. Z., Glassman, S. C., Manasse, M. S., and Zweig, G. 1997. Syntactic clustering of the Web. Comput. Netw. ISDN Syst. 29, 8--13, 1157--1166. Google ScholarDigital Library
Burrows, M. and Wheeler, D. J. 1994. A block-sorting loss less data compression algorithm. Tech. rep., Digital Systems Research Center.Google Scholar
Cai, D., Yu, S., Wen, J. R., and Ma, W. Y. 2003. Extracting content structure for Web pages based on visual representation. In Proceedings of the 5th Asian-Pacific Web Conference on Web Technologies and Applications. Lecture Notes in Computer Science, vol. 2642, 406--417. Google ScholarDigital Library
Cebrian, M., Alfonseca, M., and Ortega, A. 2005. Common pitfalls using the normalized compression distance: What to watch out for in a compressor. Comm. Inform. Syst. 54, 367--384.Google Scholar
Cebrian, M., Alfonseca, M., and Ortega, A. 2007. The normalized compression distance is resistant to noise. IEEE Trans. Inform. Theory 53, 5, 1895--1900.Google ScholarDigital Library
Cernian, A., Carstoiu, D., and Olteanu, A. 2008. Clustering heterogeneous Web data using clustering bv compression validity. In Proceedings of the International Symposium on Symbolic and Numeric Algorithms for Scientific Computing. Google ScholarDigital Library
Chaitin, G. 1. 1987. Algorithmic Information Theory. Cambridge University Press. Google ScholarDigital Library
Charikar, M. S. 2002. Similarity estimation techniques from rounding algorithms. In Proceedings of the ACM Symposium on Theory of Computing. Google ScholarDigital Library
Chou, N., Ledesma, R., Teraguchi, Y., Boneh, D., and Mitchell, J. C. 2004. Client-side defense against Web-based identity theft. In Proceedings of the Annual Network and Distributed System Security Symposium.Google Scholar
Cilibrasi, R. and Vitanyi, P. M. B. 2005. Clustering by compression. IEEE Trans. Inform. Theory 51, 4, 1523--545. Google ScholarDigital Library
Cranor, L., Egelman, S., Hong, J., and Zhang, Y. 2007. Phinding phish: Evaluating anti-phishing toolbars. In Proceedings of the Annual Network and Distributed System Security Symposium.Google Scholar
Dean, J. and Henzinger, M. R. 1999. Finding related pages in the World Wide Web. Comput. Netw. 31, 11--16, 1467--1479. Google ScholarDigital Library
Delany, S. J. and Bridge, D. 2006. Textual case-based reasoning for spam filtering: A comparison of feature-based and feature-free approaches. Artif. Intell. Rev. 26, 75--87. Google ScholarDigital Library
Dhamija, R. and Tygar, J. D. 2006. Why phishing works. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. Google ScholarDigital Library
Dhamija. R. and Tygar, J. D. 2005. The battle against phishing: Dynamic security skins. In Proceedings of the Symposium on Usable Privacy and Security. Google ScholarDigital Library
Dorner, D. 1997. The Logic of Failure. Metropolitan Books, Cambridge. MA.Google Scholar
DSLReports.com. 2008. Phish tracker. http://www.dslreports.com/phishtrack.Google Scholar
eBay. 2008. Weleome to eBay. https://signin.ebay.com/ws/eBayISAPl.dll?Signln&ru=http&percnt;&percnt;3A &percnt;2F&percnt;2F.Google Scholar
Emigh, A. 2005. Online identity theft: Phishing technology, chokepoints and countermeasures. Tech rep., Radix Labs.Google Scholar
Feldt, R., Torkar, R., Gorschek, T., and Afzal, W. 2008. Searching for cognitively diverse tests: Towards universal test diversity metrics. In Proceedings of the IEEE International Conference on Software Testing Verification and Validation Workshop. Google ScholarDigital Library
Fette, I., Sadeh, N., and Tomasic, A. 2007. Learning to detect phishing emails. In Proceedings of the International World Wide Web Conference. Google ScholarDigital Library
Florencio, D. and Herley, C. 2005. Stopping a phishing attack, even when the victims ignore warnings. Tech. rep., Microsoft Research., Redmond, WA.Google Scholar
Fu, A. Y., Wenyin, L., and Deng, X. 2006. Detecting phishing Web pages with visual similarity assessment based on earth mover's distance (EMD). IEEE Trans. Depend. Secure Comput. 3, 4, 301--311. Google ScholarDigital Library
Gordon, I. E. 2004. Theories of Visual Perception, 3rd Ed. Psychology Press, New York.Google Scholar
Graham, L. 2008. Gestalt theory in interactive media design. Human. Soc. Sci. 2, 1, 3.1--3.12.Google Scholar
Granados, A., Cebrian, M., Camacho, D., and Rodriguez, F. B. 2008. Evaluating the impact of information distortion on normalized compression distance. In Proceedings of the 2nd International Castle Meeting on Coding Theory and Applications. Google ScholarDigital Library
Haveliwala, T. H., Gionis, A., Klein, D., and Indyk, P. 2002. Evaluating strategies for similarity search on the Web. In Proceedings of the International World Wide Web Conference. Google ScholarDigital Library
Heintze, L. 1996. Scalable document fingerprinting. In Proceedings of the USENIX Workshop on Electronic Commerce.Google Scholar
Henzinger, M. 2006. Finding near-duplicate Web pages: A large-scale evaluation of algorithms. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. Google ScholarDigital Library
Hescott, B. and Koulomzin, D. 2007. On clustering images using compression. Tech. rep., Computer Science Department, Boston University.Google Scholar
Hou, 1. and Zhang, Y. 2003. Utilizing hyperlink transitivity to improve Web page clustering. In Proceedings of the Australasian Database Conference. Google ScholarDigital Library
Kalviainen, M. 2007. The role of sign elements in holistic product meaning. In Proceedings of the SeFun International Seminar on Design Semiotics in Use.Google Scholar
Kepes, G. 1944. Language of Vision. Paul Theobald, Chicago, IL.Google Scholar
Lan, Y. and Harvey, R. 2005. Image classification using compression distance. In Proceedings of the 2nd International Conference on Vision, Video and Graphics.Google Scholar
Li, M. and Vitanyi, P. 1997. An Introduction to Kolmogorov Complexity and its Applications, 2nd Ed. Springer-Verlag, Berlin. Google ScholarDigital Library
Li, M. and Zhu, Y. 2006. Image classification via LZ78-based string kernel: A comparative study. In Proceedings of the 10th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining. Lecture Notes in Computer Science, vol. 3918, 704--712. Google ScholarDigital Library
Li, M., Chen, X., Li, X., Ma, B., and Vitanyi, P. M. B. 2004. The similarity metric. IEEE Trans. Inform. Theory 50, 12, 3250--3264. Google ScholarDigital Library
Macedonas, A., Besiris, D., Economou, G., and Fotopoulos, S. 2008. Dictionary based color image retrieval. J. Vis. Comm. Image Rep. 19, 464--470. Google ScholarDigital Library
Mack, A. and Rock, I. 1998a. lnattentional Blindness. MIT Press.Google Scholar
Mack, A. and Rock, I. 1998b. Inattentional blindness: Perception without attention. In Visual Attention, R. D. Wright Ed., Oxford University Press, Oxford, UK, 55--76.Google Scholar
MacKay, W. E. 1991. Triggers and barriers to customizing software. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. Google ScholarDigital Library
Manber, U. 1994. Finding similar files in a large file system. In Proceedings of the USENIX Winter Technical Conference. Google ScholarDigital Library
McCall, I. 2007. Gartner survey shows phishing attacks escalated in 2007: More than &doller; 3 Billion lost to these attacks. Gartner, Inc., http://www.gartner.comit.pclgc.jsp?id=565l25.Google Scholar
Microsoft. 2009. Get Internet Explorer 7. http://www.microsoft.eom/windows/internet-explorer/ie7.Google Scholar
Mozilla. 2008. FireFox Web Brower. http://www.mozilla.com.en-US firefox/.Google Scholar
Mozilla. 2009. Thunderbird—Reclaim Your Inbox. http://www.mozilla.com/en-US/thunderbird.Google Scholar
Netcraft. 2009. Netcraft Anti-Phishing Toolbar. http: toolbar.netcraft.com.Google Scholar
Ofuonye, E., Beatty, P., Dick, S., and Miller, J. 2010. Prevalence and classification of Web page defects. Online Inform. Rev. 34, 1, 160--174.Google ScholarCross Ref
OpenDNS. 2008. PhishTank. Join the fight against phishing. http://www.phishtank.com/phish_archive.php.Google Scholar
Pavlov, I. 2009. 7z Format. 7Zip, http://www.7-zip.org/.Google Scholar
Provost, F., Fawcett, T., and Kohavi, R. 1998. The case against accuracy estimation for comparing induction algorithms. In Proceedings of the International Conference on Machine Learning. Google ScholarDigital Library
Quiney, H. M., Nugent, K. A., and Peele, A. G. 2006. Iterative image reconstruction algorithms using wave-front intensity and phase variation. Optics Lett. 30, 13, 1638--1640.Google ScholarCross Ref
Rosiello, A. P. E., Kirda, E., Kruegel, C., and Ferrandi, F. 2007. A layout-similaritv-based approach for detecting phishing pages. In Proceedings of the IEEE International Conference on Security and Privacy in Communications Networks and the Workshops.Google Scholar
Rourke, L., Anderson, T., Garrison, D. R., and Archer, W. 2001. Methodological issues in the content analysis of computer conference transcripts. Int. J. Artif. Intel. Educ. 12, 8--22.Google Scholar
RSA. 2009. RSA Identity Protection and Veritication Suite. http://www.rsa.eom!node.aspx?id=30l7.Google Scholar
Salomon, D. 2007. Data Compression: The Complete Reference. Springer-Verlag. Google ScholarDigital Library
Sheikh, H. R., Bovik, A. C., and Cormack, L. K. 2005. No-reference quality assessment using natural scene statistics JPEG2000. IEEE Trans. Image Process. 14, 11, 1918--1927.Google ScholarDigital Library
Sheikh, H. R., Sabir, M. F., and Bovik, A. C. 2006. A statistical evaluation of recent full reference image quality assessment algorithms. IEEE Trans. Image Process. 15, 11, 3449--3451. Google ScholarDigital Library
Shen, D., Chen, Z., Yang, Q., Zeng, H.-J., Zhang, B., Lu, Y., and Ma, W.-Y. 2004. Web-page classification through summarization. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. Google ScholarDigital Library
Strimmer, K. and von Haeseler, A. 1996. Quartet puzzling: A quartet maximum-likelihood method for reconstructing tree topologies. Molec. Biol. Evol. 13, 7, 964--969.Google ScholarCross Ref
Toet, A. and Lucassen, M. P. 2003. A new universal colour image fidelity metric. Displays 24, 4--5, 197--207.Google ScholarCross Ref
Venkatesh Babu, R., Suresh, S., and Perkis, A. 2007. No-reference JPEG-image quality assessment using GAP-RBF. Signal Process. 87, 6, 1493--1503. Google ScholarDigital Library
Wang, Y. and Kitsuregawa, M. 2002. Evaluating contents-link coupled Web page clustering for Web search results. In Proceedings of the International Conference on Information and Knowledge Management. Google ScholarDigital Library
Wang, Z., Bovik, A. C., Sheikh, H. R., and Simoncelli, E. P. 2004. Image quality assessment: From error visibility to structural simihrity. IEEE Trans. Image Process. 13, 4, 600--612. Google ScholarDigital Library
Wang, Z., Simoncelli, E. P., and Bovik, A. C. 2003. Translation insensitive image similarity for image quality assessment. In Proceedings of the IEEE Asilomar Conference on Signals, Systems and Computers.Google Scholar
Wertheimer, M. 1944. Gestalt Theory. Hayes Barton Press, New York.Google Scholar
Wu, C.-T., Cheng, K.-T., Zhu, Q., and Wu, Y.-L. 2005. Using visual features for anti-spam filtering. In Proceedings of the IEEE International Conference on Image Processing.Google Scholar
Wu, M., Miller, R. C., and Garfinkel, S. L. 2006. Do security toolbars actually prevent phishing attacks? In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. Google ScholarDigital Library
Xiang. G. and Hong, J. 2009. A hybrid phish detection approach by identity discovery and keywords retrieval. In Proceedings of the International World-Wide Web Conference. Google ScholarDigital Library
Yahoo. 2009. Yahoo! Personalized Sign-In Seal. https://protect.login.yahoo.com.Google Scholar
Yih, W., 1. Goodman, J., and Hulten, G. 2006. Learning at low false positive rates. In Proceedings of the 3rd Conference on Email and AntiSpam.Google Scholar
Zhang, Y., Hong, J. and Cranor, L. 2007. CANTINA: A content-based approach to detecting phishing Web sites. In Proceedings of the International World-Wide Web Conference. Google ScholarDigital Library
Ziv. J. and Lempel, A. 1977. A universal algorithm for sequential data compression. IEEE Trans. Inform. Theory 23, 3, 337--343.Google ScholarDigital Library

Index Terms

Detecting visually similar Web pages: Application to phishing detection
1. Human-centered computing
  1. Collaborative and social computing
    1. Collaborative and social computing systems and tools
2. Information systems
  1. World Wide Web

Recommendations

Detecting phishing web pages based on DOM-tree structure and graph matching algorithm
SoICT '14: Proceedings of the 5th Symposium on Information and Communication Technology

Most modern day phishing attacks occur by luring users into visiting a malicious web page that looks and behaves like the original. Phishing is a web-based attack which end users are lured to visit fraudulent websites and give away personal information ...
Read More
A survey of Web metrics

The unabated growth and increasing significance of the World Wide Web has resulted in a flurry of research activity to improve its capacity for serving information more effectively. But at the heart of these efforts lie implicit assumptions about "...
Read More
Utilizing hyperlink transitivity to improve web page clustering
ADC '03: Proceedings of the 14th Australasian database conference - Volume 17

The rapid increase of web complexity and size makes web searched results far from satisfaction in many cases due to a huge amount of information returned by search engines. How to find intrinsic relationships among the web pages at a higher level to ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM Transactions on Internet Technology Volume 10, Issue 2
May 2010
123 pages
ISSN:1533-5399
EISSN:1557-6051
DOI:10.1145/1754393
Issue’s Table of Contents

Copyright © 2010 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 10 June 2010
- Accepted: 1 December 2009
- Revised: 1 June 2009
- Received: 1 January 2009
Published in toit Volume 10, Issue 2

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Algorithmic complexity theory
Gestalt theory
Web page similarity
anti-phishing technologies
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 78
  Total Citations
  View Citations
- 2,127
  Total Downloads
- Downloads (Last 12 months)52
- Downloads (Last 6 weeks)3
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Detecting visually similar Web pages: Application to phishing detection

ACM Transactions on Internet Technology

Abstract

References

Cited By

Index Terms

Recommendations

Detecting phishing web pages based on DOM-tree structure and graph matching algorithm

A survey of Web metrics

Utilizing hyperlink transitivity to improve web page clustering

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Detecting visually similar Web pages: Application to phishing detection

ACM Transactions on Internet Technology

Abstract

References

Cited By

Index Terms

Recommendations

Detecting phishing web pages based on DOM-tree structure and graph matching algorithm

A survey of Web metrics

Utilizing hyperlink transitivity to improve web page clustering

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media